From fedafc4a03d73b87f72d539d9476bdfb46fc7b77 Mon Sep 17 00:00:00 2001 From: <> Date: Sat, 17 Feb 2024 18:51:25 +0000 Subject: [PATCH] Deployed d139926b8 with MkDocs version: 1.5.3 --- SparkContext/index.html | 22 +++++----------------- scheduler/DAGScheduler/index.html | 4 ++-- scheduler/TaskSetManager/index.html | 17 ++++++++++------- search/search_index.json | 2 +- sitemap.xml.gz | Bin 3601 -> 3601 bytes 5 files changed, 18 insertions(+), 27 deletions(-) diff --git a/SparkContext/index.html b/SparkContext/index.html index 6ab62fb21..d8d58bc99 100644 --- a/SparkContext/index.html +++ b/SparkContext/index.html @@ -1,4 +1,4 @@ - SparkContext - The Internals of Spark Core
Skip to content

SparkContext

SparkContext is the entry point to all of the components of Apache Spark (execution engine) and so the heart of a Spark application. In fact, you can consider an application a Spark application only when it uses a SparkContext (directly or indirectly).

Spark context acts as the master of your Spark application

Important

There should be one active SparkContext per JVM and Spark developers should use SparkContext.getOrCreate utility for sharing it (e.g. across threads).

Creating Instance

SparkContext takes the following to be created:

SparkContext is created (directly or indirectly using getOrCreate utility).

While being created, SparkContext sets up core services and establishes a connection to a cluster manager.

Checkpoint Directory

SparkContext defines checkpointDir internal registry for the path to a checkpoint directory.

checkpointDir is undefined (None) when SparkContext is created and is set using setCheckpointDir.

checkpointDir is required for Reliable Checkpointing.

checkpointDir is available using getCheckpointDir.

getCheckpointDir

getCheckpointDir: Option[String]
+ SparkContext - The Internals of Spark Core      

SparkContext

SparkContext is the entry point to all of the components of Apache Spark (execution engine) and so the heart of a Spark application. In fact, you can consider an application a Spark application only when it uses a SparkContext (directly or indirectly).

Spark context acts as the master of your Spark application

Important

There should be one active SparkContext per JVM and Spark developers should use SparkContext.getOrCreate utility for sharing it (e.g. across threads).

Creating Instance

SparkContext takes the following to be created:

SparkContext is created (directly or indirectly using getOrCreate utility).

While being created, SparkContext sets up core services and establishes a connection to a cluster manager.

Checkpoint Directory

SparkContext defines checkpointDir internal registry for the path to a checkpoint directory.

checkpointDir is undefined (None) when SparkContext is created and is set using setCheckpointDir.

checkpointDir is required for Reliable Checkpointing.

checkpointDir is available using getCheckpointDir.

getCheckpointDir

getCheckpointDir: Option[String]
 

getCheckpointDir returns the checkpointDir.

getCheckpointDir is used when:

Submitting MapStage for Execution

submitMapStage[K, V, C](
   dependency: ShuffleDependency[K, V, C]): SimpleFutureAction[MapOutputStatistics]
 

submitMapStage requests the DAGScheduler to submit the given ShuffleDependency for execution (that eventually produces a MapOutputStatistics).

submitMapStage is used when:

  • ShuffleExchangeExec (Spark SQL) unary physical operator is executed

ExecutorMetricsSource

SparkContext creates an ExecutorMetricsSource when created with spark.metrics.executorMetricsSource.enabled enabled.

SparkContext requests the ExecutorMetricsSource to register with the MetricsSystem.

SparkContext uses the ExecutorMetricsSource to create the Heartbeater.

Services

ResourceProfileManager

SparkContext creates a ResourceProfileManager when created.

resourceProfileManager

resourceProfileManager: ResourceProfileManager
@@ -88,7 +88,10 @@
   rp: ResourceProfile): Int
 

maxNumConcurrentTasks requests the SchedulerBackend for the maximum number of tasks that can be launched concurrently (with the given ResourceProfile).


maxNumConcurrentTasks is used when:

withScope

withScope[U](
   body: => U): U
-

withScope withScope with this SparkContext.

Note

withScope is used for most (if not all) SparkContext API operators.

Logging

Enable ALL logging level for org.apache.spark.SparkContext logger to see what happens inside.

Add the following line to conf/log4j2.properties:

logger.SparkContext.name = org.apache.spark.SparkContext
+

withScope withScope with this SparkContext.

Note

withScope is used for most (if not all) SparkContext API operators.

Finding Preferred Locations for RDD Partition

getPreferredLocs(
+  rdd: RDD[_],
+  partition: Int): Seq[TaskLocation]
+

getPreferredLocs requests the DAGScheduler for the preferred locations of the given partition (of the given RDD).

Note

Preferred locations of a RDD partition are also referred to as placement preferences or locality preferences.


getPreferredLocs is used when:

  • CoalescedRDDPartition is requested to localFraction
  • DefaultPartitionCoalescer is requested to currPrefLocs
  • PartitionerAwareUnionRDD is requested to currPrefLocs

Logging

Enable ALL logging level for org.apache.spark.SparkContext logger to see what happens inside.

Add the following line to conf/log4j2.properties:

logger.SparkContext.name = org.apache.spark.SparkContext
 logger.SparkContext.level = all
 

Refer to Logging.

spark-internals

TaskSetManager

TaskSetManager is a Schedulable that manages scheduling the tasks of a TaskSet.

TaskSetManager

Creating Instance

TaskSetManager takes the following to be created:

TaskSetManager is created when:


While being created, TaskSetManager requests the current epoch from MapOutputTracker and sets it on all tasks in the taskset.

TaskSetManager prints out the following DEBUG to the logs:

Epoch for [taskSet]: [epoch]
-

TaskSetManager adds the tasks as pending execution (in reverse order from the highest partition to the lowest).

Number of Task Failures

TaskSetManager is given maxTaskFailures value that is how many times a single task can fail before the whole TaskSet is aborted.

Master URL Number of Task Failures
local 1
local-with-retries maxFailures
local-cluster spark.task.maxFailures
Cluster Manager spark.task.maxFailures

isBarrier

isBarrier: Boolean
+ TaskSetManager - The Internals of Spark Core      

TaskSetManager

TaskSetManager is a Schedulable that manages scheduling the tasks of a TaskSet.

TaskSetManager

Creating Instance

TaskSetManager takes the following to be created:

TaskSetManager is created when:


While being created, TaskSetManager requests the current epoch from MapOutputTracker and sets it on all tasks in the taskset.

TaskSetManager prints out the following DEBUG to the logs:

Epoch for [taskSet]: [epoch]
+

TaskSetManager adds the tasks as pending execution (in reverse order from the highest partition to the lowest).

Number of Task Failures

TaskSetManager is given maxTaskFailures value that is how many times a single task can fail before the whole TaskSet is aborted.

Master URL Number of Task Failures
local 1
local-with-retries maxFailures
local-cluster spark.task.maxFailures
Cluster Manager spark.task.maxFailures

isBarrier

isBarrier: Boolean
 

isBarrier is enabled (true) when this TaskSetManager is created for a TaskSet with barrier tasks.


isBarrier is used when:

resourceOffer

resourceOffer(
   execId: String,
   host: String,
   maxLocality: TaskLocality.TaskLocality,
   taskCpus: Int = sched.CPUS_PER_TASK,
   taskResourceAssignments: Map[String, ResourceInformation] = Map.empty): (Option[TaskDescription], Boolean, Int)
-

resourceOffer determines allowed locality level for the given TaskLocality being anything but NO_PREF.

resourceOffer dequeueTask for the given execId and host, and the allowed locality level. This may or may not give a TaskDescription.

In the end, resourceOffer returns the TaskDescription, hasScheduleDelayReject, and the index of the dequeued task (if any).


resourceOffer returns a (None, false, -1) tuple when this TaskSetManager is isZombie or the offer (by the given host or execId) should be ignored (excluded).


resourceOffer is used when:

Locality Wait

getLocalityWait(
+

resourceOffer determines allowed locality level for the given TaskLocality being anything but NO_PREF.

resourceOffer dequeueTask for the given execId and host, and the allowed locality level. This may or may not give a TaskDescription.

In the end, resourceOffer returns the TaskDescription, hasScheduleDelayReject, and the index of the dequeued task (if any).


resourceOffer returns a (None, false, -1) tuple when this TaskSetManager is isZombie or the offer (by the given host or execId) should be ignored (excluded).


resourceOffer is used when:

Locality Wait

getLocalityWait(
   level: TaskLocality.TaskLocality): Long
-

getLocalityWait is 0 for legacyLocalityWaitReset and isBarrier flags enabled.

getLocalityWait determines the value of locality wait based on the given TaskLocality.TaskLocality.

TaskLocality Configuration Property
PROCESS_LOCAL spark.locality.wait.process
NODE_LOCAL spark.locality.wait.node
RACK_LOCAL spark.locality.wait.rack

Unless the value has been determined, getLocalityWait defaults to 0.

Note

NO_PREF and ANY task localities have no locality wait.


getLocalityWait is used when:

spark.driver.maxResultSize

TaskSetManager uses spark.driver.maxResultSize configuration property to check available memory for more task results.

Recomputing Task Locality Preferences

recomputeLocality(): Unit
-

If zombie, recomputeLocality does nothing.

recomputeLocality recomputes myLocalityLevels, localityWaits and currentLocalityIndex internal registries.

recomputeLocality computes locality levels (for scheduled tasks) and saves the result in myLocalityLevels internal registry.

recomputeLocality computes localityWaits by determining the locality wait for every locality level in myLocalityLevels.

recomputeLocality computes currentLocalityIndex by getLocalityIndex with the previous locality level. If the current locality index is higher than the previous, recomputeLocality recalculates currentLocalityIndex.


recomputeLocality is used when:

Zombie

A TaskSetManager is a zombie when all tasks in a taskset have completed successfully (regardless of the number of task attempts), or if the taskset has been aborted.

While in zombie state, a TaskSetManager can launch no new tasks and responds with no TaskDescriptions to resourceOffers.

A TaskSetManager remains in the zombie state until all tasks have finished running, i.e. to continue to track and account for the running tasks.

Computing Locality Levels (for Scheduled Tasks)

computeValidLocalityLevels(): Array[TaskLocality.TaskLocality]
+

getLocalityWait is 0 for legacyLocalityWaitReset and isBarrier flags enabled.

getLocalityWait determines the value of locality wait based on the given TaskLocality.TaskLocality.

TaskLocality Configuration Property
PROCESS_LOCAL spark.locality.wait.process
NODE_LOCAL spark.locality.wait.node
RACK_LOCAL spark.locality.wait.rack

Unless the value has been determined, getLocalityWait defaults to 0.

Note

NO_PREF and ANY task localities have no locality wait.


getLocalityWait is used when:

spark.driver.maxResultSize

TaskSetManager uses spark.driver.maxResultSize configuration property to check available memory for more task results.

Recomputing Task Locality Preferences

recomputeLocality(): Unit
+

If zombie, recomputeLocality does nothing.

recomputeLocality recomputes myLocalityLevels, localityWaits and currentLocalityIndex internal registries.

recomputeLocality computes locality levels (for scheduled tasks) and saves the result in myLocalityLevels internal registry.

recomputeLocality computes localityWaits by determining the locality wait for every locality level in myLocalityLevels.

recomputeLocality computes currentLocalityIndex by getLocalityIndex with the previous locality level. If the current locality index is higher than the previous, recomputeLocality recalculates currentLocalityIndex.


recomputeLocality is used when:

Zombie

A TaskSetManager is a zombie when all tasks in a taskset have completed successfully (regardless of the number of task attempts), or if the taskset has been aborted.

While in zombie state, a TaskSetManager can launch no new tasks and responds with no TaskDescriptions to resourceOffers.

A TaskSetManager remains in the zombie state until all tasks have finished running, i.e. to continue to track and account for the running tasks.

Computing Locality Levels (for Scheduled Tasks)

computeValidLocalityLevels(): Array[TaskLocality.TaskLocality]
 

computeValidLocalityLevels computes valid locality levels for tasks that were registered in corresponding registries per locality level.

Note

TaskLocality is a locality preference of a task and can be the most localized PROCESS_LOCAL, NODE_LOCAL through NO_PREF and RACK_LOCAL to ANY.

For every pending task (in pendingTasks registry), computeValidLocalityLevels requests the TaskSchedulerImpl for acceptable TaskLocalityies:

computeValidLocalityLevels always registers ANY task locality level.

In the end, computeValidLocalityLevels prints out the following DEBUG message to the logs:

Valid locality levels for [taskSet]: [comma-separated levels]
-

computeValidLocalityLevels is used when:

executorAdded

executorAdded(): Unit
+

computeValidLocalityLevels is used when:

executorAdded

executorAdded(): Unit
 

executorAdded recomputeLocality.


executorAdded is used when:

prepareLaunchingTask

prepareLaunchingTask(
   execId: String,
   host: String,
@@ -21,7 +21,10 @@
   taskCpus: Int,
   taskResourceAssignments: Map[String, ResourceInformation],
   launchTime: Long): TaskDescription
-
taskResourceAssignments

taskResourceAssignments are the resources that are passed in to resourceOffer.

prepareLaunchingTask...FIXME


prepareLaunchingTask is used when:

Demo

Enable DEBUG logging level for org.apache.spark.scheduler.TaskSchedulerImpl (or org.apache.spark.scheduler.cluster.YarnScheduler for YARN) and org.apache.spark.scheduler.TaskSetManager and execute the following two-stage job to see their low-level innerworkings.

A cluster manager is recommended since it gives more task localization choices (with YARN additionally supporting rack localization).

$ ./bin/spark-shell \
+
taskResourceAssignments

taskResourceAssignments are the resources that are passed in to resourceOffer.

prepareLaunchingTask...FIXME


prepareLaunchingTask is used when:

Serialized Task Size Threshold

TaskSetManager object defines TASK_SIZE_TO_WARN_KIB value as the threshold to warn a user if any stages contain a task that has a serialized size greater than 1000 kB.

DAGScheduler

DAGScheduler can print out the following WARN message to the logs when requested to submitMissingTasks:

Broadcasting large task binary with size [taskBinaryBytes] [siByteSuffix]
+

TaskSetManager

TaskSetManager can print out the following WARN message to the logs when requested to prepareLaunchingTask:

Stage [stageId] contains a task of very large size ([serializedTask] KiB).
+The maximum recommended task size is 1000 KiB.
+

Demo

Enable DEBUG logging level for org.apache.spark.scheduler.TaskSchedulerImpl (or org.apache.spark.scheduler.cluster.YarnScheduler for YARN) and org.apache.spark.scheduler.TaskSetManager and execute the following two-stage job to see their low-level innerworkings.

A cluster manager is recommended since it gives more task localization choices (with YARN additionally supporting rack localization).

$ ./bin/spark-shell \
     --master yarn \
     --conf spark.ui.showConsoleProgress=false
 
diff --git a/search/search_index.json b/search/search_index.json
index 455ed7a8b..22c3a80ce 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"The Internals of Spark Core (Apache Spark 3.5.0)","text":"

Welcome to The Internals of Spark Core online book! \ud83e\udd19

I'm Jacek Laskowski, a Freelance Data Engineer specializing in Apache Spark (incl. Spark SQL and Spark Structured Streaming), Delta Lake, Databricks, and Apache Kafka (incl. Kafka Streams) with brief forays into a wider data engineering space (e.g., Trino, Dask and dbt, mostly during Warsaw Data Engineering meetups).

I'm very excited to have you here and hope you will enjoy exploring the internals of Spark Core as much as I have.

Flannery O'Connor

I write to discover what I know.

\"The Internals Of\" series

I'm also writing other online books in the \"The Internals Of\" series. Please visit \"The Internals Of\" Online Books home page.

Expect text and code snippets from a variety of public sources. Attribution follows.

Now, let's take a deep dive into Spark Core \ud83d\udd25

Last update: 2024-02-17

"},{"location":"BytesToBytesMap/","title":"BytesToBytesMap","text":"

BytesToBytesMap is a memory consumer that supports spilling.

Spark SQL

BytesToBytesMap is used in Spark SQL only in the following:

  • UnsafeFixedWidthAggregationMap
  • UnsafeHashedRelation
"},{"location":"BytesToBytesMap/#creating-instance","title":"Creating Instance","text":"

BytesToBytesMap takes the following to be created:

  • TaskMemoryManager
  • BlockManager
  • SerializerManager
  • Initial Capacity
  • Load Factor (default: 0.5)
  • Page Size (bytes)

    BytesToBytesMap is created when:

    • UnsafeFixedWidthAggregationMap (Spark SQL) is created
    • UnsafeHashedRelation (Spark SQL) is created
    "},{"location":"BytesToBytesMap/#destructive-mapiterator","title":"Destructive MapIterator
    MapIterator destructiveIterator\n

    BytesToBytesMap defines a reference to a \"destructive\" MapIterator (if ever created for UnsafeFixedWidthAggregationMap (Spark SQL)).

    The destructiveIterator reference is in two states:

    • Undefined (null) initially when BytesToBytesMap is created
    • The MapIterator if created
    ","text":""},{"location":"BytesToBytesMap/#creating-destructive-mapiterator","title":"Creating Destructive MapIterator
    MapIterator destructiveIterator()\n

    destructiveIterator updatePeakMemoryUsed and then creates a MapIterator with the following:

    • numValues for the number of records
    • A new Location
    • Destructive flag enabled (true)

    destructiveIterator is used when:

    • UnsafeFixedWidthAggregationMap (Spark SQL) is created
    ","text":""},{"location":"BytesToBytesMap/#spilling","title":"Spilling
    long spill(\n  long size,\n  MemoryConsumer trigger)\n

    spill is part of the MemoryConsumer abstraction.

    Only when the given MemoryConsumer is not this BytesToBytesMap and the destructive MapIterator has been used, spill requests the destructive MapIterator to spill (the given size bytes).

    spill returns 0 when the trigger is this BytesToBytesMap or there is no destructiveIterator in use. Otherwise, spill returns how much bytes the destructiveIterator managed to release.

    ","text":""},{"location":"BytesToBytesMap/#numvalues","title":"numValues

    numValues registry is 0 after reset.

    numValues is incremented when Location is requested to append

    numValues can never be bigger than maximum capacity of this BytesToBytesMap or growthThreshold.

    ","text":""},{"location":"BytesToBytesMap/#maximum-capacity","title":"Maximum Capacity

    BytesToBytesMap supports up to 1 << 29 keys.

    BytesToBytesMap makes sure that the initialCapacity is not bigger when creted.

    ","text":""},{"location":"BytesToBytesMap/#allocating-memory","title":"Allocating Memory
    void allocate(\n  int capacity)\n

    allocate...FIXME

    allocate is used when:

    • BytesToBytesMap is created, reset, growAndRehash
    ","text":""},{"location":"BytesToBytesMap/#growing-memory-and-rehashing","title":"Growing Memory And Rehashing
    void growAndRehash()\n

    growAndRehash...FIXME

    growAndRehash is used when:

    • Location is requested to append (a new value for a key)
    ","text":""},{"location":"ConsoleProgressBar/","title":"ConsoleProgressBar","text":"

    ConsoleProgressBar shows the progress of active stages to standard error, i.e. stderr. It uses SparkStatusTracker to poll the status of stages periodically and print out active stages with more than one task. It keeps overwriting itself to hold in one line for at most 3 first concurrent stages at a time.

    [Stage 0:====>          (316 + 4) / 1000][Stage 1:>                (0 + 0) / 1000][Stage 2:>                (0 + 0) / 1000]]]\n

    The progress includes the stage id, the number of completed, active, and total tasks.

    TIP: ConsoleProgressBar may be useful when you ssh to workers and want to see the progress of active stages.

    <ConsoleProgressBar is created>> when SparkContext is created with spark.ui.showConsoleProgress enabled and the logging level of SparkContext.md[org.apache.spark.SparkContext] logger as WARN or higher (i.e. less messages are printed out and so there is a \"space\" for ConsoleProgressBar)."},{"location":"ConsoleProgressBar/#source-scala","title":"[source, scala]","text":"

    import org.apache.log4j._ Logger.getLogger(\"org.apache.spark.SparkContext\").setLevel(Level.WARN)

    To print the progress nicely ConsoleProgressBar uses COLUMNS environment variable to know the width of the terminal. It assumes 80 columns.

    The progress bar prints out the status after a stage has ran at least 500 milliseconds every spark-webui-properties.md#spark.ui.consoleProgress.update.interval[spark.ui.consoleProgress.update.interval] milliseconds.

    NOTE: The initial delay of 500 milliseconds before ConsoleProgressBar show the progress is not configurable.

    See the progress bar in Spark shell with the following:

    "},{"location":"ConsoleProgressBar/#source","title":"[source]","text":"

    $ ./bin/spark-shell --conf spark.ui.showConsoleProgress=true # <1>

    scala> sc.setLogLevel(\"OFF\") // <2>

    import org.apache.log4j._ scala> Logger.getLogger(\"org.apache.spark.SparkContext\").setLevel(Level.WARN) // <3>

    scala> sc.parallelize(1 to 4, 4).map { n => Thread.sleep(500 + 200 * n); n }.count // <4> [Stage 2:> (0 + 4) / 4] [Stage 2:==============> (1 + 3) / 4] [Stage 2:=============================> (2 + 2) / 4] [Stage 2:============================================> (3 + 1) / 4]

    <1> Make sure spark.ui.showConsoleProgress is true. It is by default. <2> Disable (OFF) the root logger (that includes Spark's logger) <3> Make sure org.apache.spark.SparkContext logger is at least WARN. <4> Run a job with 4 tasks with 500ms initial sleep and 200ms sleep chunks to see the progress bar.

    TIP: https://youtu.be/uEmcGo8rwek[Watch the short video] that show ConsoleProgressBar in action.

    You may want to use the following example to see the progress bar in full glory - all 3 concurrent stages in console (borrowed from https://github.com/apache/spark/pull/3029#issuecomment-63244719[a comment to [SPARK-4017] show progress bar in console #3029]):

    > ./bin/spark-shell\nscala> val a = sc.makeRDD(1 to 1000, 10000).map(x => (x, x)).reduceByKey(_ + _)\nscala> val b = sc.makeRDD(1 to 1000, 10000).map(x => (x, x)).reduceByKey(_ + _)\nscala> a.union(b).count()\n

    === [[creating-instance]] Creating ConsoleProgressBar Instance

    ConsoleProgressBar requires a SparkContext.md[SparkContext].

    When being created, ConsoleProgressBar reads spark-webui-properties.md#spark.ui.consoleProgress.update.interval[spark.ui.consoleProgress.update.interval] configuration property to set up the update interval and COLUMNS environment variable for the terminal width (or assumes 80 columns).

    ConsoleProgressBar starts the internal timer refresh progress that does <> and shows progress.

    NOTE: ConsoleProgressBar is created when SparkContext is created, spark.ui.showConsoleProgress configuration property is enabled, and the logging level of SparkContext.md[org.apache.spark.SparkContext] logger is WARN or higher (i.e. less messages are printed out and so there is a \"space\" for ConsoleProgressBar).

    NOTE: Once created, ConsoleProgressBar is available internally as _progressBar.

    === [[finishAll]] finishAll Method

    CAUTION: FIXME

    === [[stop]] stop Method

    "},{"location":"ConsoleProgressBar/#source-scala_1","title":"[source, scala]","text":""},{"location":"ConsoleProgressBar/#stop-unit","title":"stop(): Unit","text":"

    stop cancels (stops) the internal timer.

    NOTE: stop is executed when SparkContext.md#stop[SparkContext stops].

    === [[refresh]] refresh Internal Method

    "},{"location":"ConsoleProgressBar/#source-scala_2","title":"[source, scala]","text":""},{"location":"ConsoleProgressBar/#refresh-unit","title":"refresh(): Unit","text":"

    refresh...FIXME

    NOTE: refresh is used when...FIXME

    "},{"location":"DriverLogger/","title":"DriverLogger","text":"

    DriverLogger runs on the driver (in client deploy mode) to copy driver logs to Hadoop DFS periodically.

    "},{"location":"DriverLogger/#creating-instance","title":"Creating Instance","text":"

    DriverLogger takes the following to be created:

    • SparkConf

      DriverLogger is created using apply utility.

      "},{"location":"DriverLogger/#creating-driverlogger","title":"Creating DriverLogger
      apply(\n  conf: SparkConf): Option[DriverLogger]\n

      apply creates a DriverLogger when the following hold:

      1. spark.driver.log.persistToDfs.enabled configuration property is enabled
      2. The Spark application runs in client deploy mode (and spark.submit.deployMode is client)
      3. spark.driver.log.dfsDir is specified

      apply prints out the following WARN message to the logs with no spark.driver.log.dfsDir specified:

      Driver logs are not persisted because spark.driver.log.dfsDir is not configured\n

      apply\u00a0is used when:

      • SparkContext is created
      ","text":""},{"location":"DriverLogger/#starting-dfsasyncwriter","title":"Starting DfsAsyncWriter
      startSync(\n  hadoopConf: Configuration): Unit\n

      startSync creates and starts a DfsAsyncWriter (with the spark.app.id configuration property).

      startSync\u00a0is used when:

      • SparkContext is requested to postApplicationStart
      ","text":""},{"location":"ExecutorDeadException/","title":"ExecutorDeadException","text":"

      ExecutorDeadException is a SparkException.

      "},{"location":"ExecutorDeadException/#creating-instance","title":"Creating Instance","text":"

      ExecutorDeadException takes the following to be created:

      • Error message

        ExecutorDeadException is created\u00a0when:

        • NettyBlockTransferService is requested to fetch blocks
        "},{"location":"FileCommitProtocol/","title":"FileCommitProtocol","text":"

        FileCommitProtocol is an abstraction of file committers that can setup, commit or abort a Spark job or task (while writing out a pair RDD and partitions).

        FileCommitProtocol is used for RDD.saveAsNewAPIHadoopDataset and RDD.saveAsHadoopDataset transformations (that use SparkHadoopWriter utility to write a key-value RDD out).

        FileCommitProtocol is created using FileCommitProtocol.instantiate utility.

        "},{"location":"FileCommitProtocol/#contract","title":"Contract","text":""},{"location":"FileCommitProtocol/#abortJob","title":"Aborting Job","text":"
        abortJob(\n  jobContext: JobContext): Unit\n

        Aborts a job

        Used when:

        • SparkHadoopWriter utility is used to write a key-value RDD (and writing fails)
        • (Spark SQL) FileFormatWriter utility is used to write a result of a structured query (and writing fails)
        • (Spark SQL) FileBatchWrite is requested to abort
        "},{"location":"FileCommitProtocol/#abortTask","title":"Aborting Task","text":"
        abortTask(\n  taskContext: TaskAttemptContext): Unit\n

        Abort a task

        Used when:

        • SparkHadoopWriter utility is used to write an RDD partition
        • (Spark SQL) FileFormatDataWriter is requested to abort
        "},{"location":"FileCommitProtocol/#commitJob","title":"Committing Job","text":"
        commitJob(\n  jobContext: JobContext,\n  taskCommits: Seq[TaskCommitMessage]): Unit\n

        Commits a job after the writes succeed

        Used when:

        • SparkHadoopWriter utility is used to write a key-value RDD
        • (Spark SQL) FileFormatWriter utility is used to write a result of a structured query
        • (Spark SQL) FileBatchWrite is requested to commit
        "},{"location":"FileCommitProtocol/#commitTask","title":"Committing Task","text":"
        commitTask(\n  taskContext: TaskAttemptContext): TaskCommitMessage\n

        Used when:

        • SparkHadoopWriter utility is used to write an RDD partition
        • (Spark SQL) FileFormatDataWriter is requested to commit
        "},{"location":"FileCommitProtocol/#deleteWithJob","title":"Deleting Path with Job","text":"
        deleteWithJob(\n  fs: FileSystem,\n  path: Path,\n  recursive: Boolean): Boolean\n

        deleteWithJob requests the given Hadoop FileSystem to delete a path directory.

        Used when InsertIntoHadoopFsRelationCommand logical command (Spark SQL) is executed

        "},{"location":"FileCommitProtocol/#newTaskTempFile","title":"New Task Temp File","text":"
        newTaskTempFile(\n  taskContext: TaskAttemptContext,\n  dir: Option[String],\n  spec: FileNameSpec): String\nnewTaskTempFile(\n  taskContext: TaskAttemptContext,\n  dir: Option[String],\n  ext: String): String // @deprecated\n

        Builds a path of a temporary file (for a task to write data to)

        See:

        • HadoopMapReduceCommitProtocol
        • DelayedCommitProtocol (Delta Lake)

        Used when:

        • (Spark SQL) SingleDirectoryDataWriter is requested to write a record out
        • (Spark SQL) BaseDynamicPartitionDataWriter is requested to renewCurrentWriter
        "},{"location":"FileCommitProtocol/#newTaskTempFileAbsPath","title":"newTaskTempFileAbsPath","text":"
        newTaskTempFileAbsPath(\n  taskContext: TaskAttemptContext,\n  absoluteDir: String,\n  ext: String): String\n

        Used when:

        • (Spark SQL) DynamicPartitionDataWriter is requested to write
        "},{"location":"FileCommitProtocol/#onTaskCommit","title":"On Task Committed","text":"
        onTaskCommit(\n  taskCommit: TaskCommitMessage): Unit\n

        Used when:

        • (Spark SQL) FileFormatWriter is requested to write
        "},{"location":"FileCommitProtocol/#setupJob","title":"Setting Up Job","text":"
        setupJob(\n  jobContext: JobContext): Unit\n

        Used when:

        • SparkHadoopWriter utility is used to write an RDD partition (while writing out a key-value RDD)
        • (Spark SQL) FileFormatWriter utility is used to write a result of a structured query
        • (Spark SQL) FileWriteBuilder is requested to buildForBatch
        "},{"location":"FileCommitProtocol/#setupTask","title":"Setting Up Task","text":"
        setupTask(\n  taskContext: TaskAttemptContext): Unit\n

        Sets up the task with the Hadoop TaskAttemptContext

        Used when:

        • SparkHadoopWriter is requested to write an RDD partition (while writing out a key-value RDD)
        • (Spark SQL) FileFormatWriter utility is used to write out a RDD partition (while writing out a result of a structured query)
        • (Spark SQL) FileWriterFactory is requested to createWriter
        "},{"location":"FileCommitProtocol/#implementations","title":"Implementations","text":"
        • HadoopMapReduceCommitProtocol
        • ManifestFileCommitProtocol (qv. Spark Structured Streaming)
        "},{"location":"FileCommitProtocol/#instantiating-filecommitprotocol-committer","title":"Instantiating FileCommitProtocol Committer
        instantiate(\n  className: String,\n  jobId: String,\n  outputPath: String,\n  dynamicPartitionOverwrite: Boolean = false): FileCommitProtocol\n

        instantiate prints out the following DEBUG message to the logs:

        Creating committer [className]; job [jobId]; output=[outputPath]; dynamic=[dynamicPartitionOverwrite]\n

        instantiate tries to find a constructor method that takes three arguments (two of type String and one Boolean) for the given jobId, outputPath and dynamicPartitionOverwrite flag. If found, instantiate prints out the following DEBUG message to the logs:

        Using (String, String, Boolean) constructor\n

        In case of NoSuchMethodException, instantiate prints out the following DEBUG message to the logs:

        Falling back to (String, String) constructor\n

        instantiate tries to find a constructor method that takes two arguments (two of type String) for the given jobId and outputPath.

        With two String arguments, instantiate requires that the given dynamicPartitionOverwrite flag is disabled (false) or throws an IllegalArgumentException:

        requirement failed: Dynamic Partition Overwrite is enabled but the committer [className] does not have the appropriate constructor\n

        instantiate is used when:

        • HadoopMapRedWriteConfigUtil and HadoopMapReduceWriteConfigUtil are requested to create a HadoopMapReduceCommitProtocol committer
        • (Spark SQL) InsertIntoHadoopFsRelationCommand, InsertIntoHiveDirCommand, and InsertIntoHiveTable logical commands are executed
        • (Spark Structured Streaming) FileStreamSink is requested to write out a micro-batch data
        ","text":""},{"location":"FileCommitProtocol/#logging","title":"Logging

        Enable ALL logging level for org.apache.spark.internal.io.FileCommitProtocol logger to see what happens inside.

        Add the following line to conf/log4j.properties:

        log4j.logger.org.apache.spark.internal.io.FileCommitProtocol=ALL\n

        Refer to Logging.

        ","text":""},{"location":"HadoopMapRedCommitProtocol/","title":"HadoopMapRedCommitProtocol","text":"

        HadoopMapRedCommitProtocol is...FIXME

        "},{"location":"HadoopMapRedWriteConfigUtil/","title":"HadoopMapRedWriteConfigUtil","text":"

        HadoopMapRedWriteConfigUtil is a HadoopWriteConfigUtil for RDD.saveAsHadoopDataset operator.

        "},{"location":"HadoopMapRedWriteConfigUtil/#creating-instance","title":"Creating Instance","text":"

        HadoopMapRedWriteConfigUtil takes the following to be created:

        • SerializableJobConf

          HadoopMapRedWriteConfigUtil is created when:

          • PairRDDFunctions is requested to saveAsHadoopDataset
          "},{"location":"HadoopMapRedWriteConfigUtil/#logging","title":"Logging","text":"

          Enable ALL logging level for org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil logger to see what happens inside.

          Add the following line to conf/log4j2.properties:

          logger.HadoopMapRedWriteConfigUtil.name = org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil\nlogger.HadoopMapRedWriteConfigUtil.level = all\n

          Refer to Logging.

          "},{"location":"HadoopMapReduceCommitProtocol/","title":"HadoopMapReduceCommitProtocol","text":"

          HadoopMapReduceCommitProtocol is a FileCommitProtocol.

          HadoopMapReduceCommitProtocol is a Serializable (Java) (to be sent out in tasks over the wire to executors).

          "},{"location":"HadoopMapReduceCommitProtocol/#creating-instance","title":"Creating Instance","text":"

          HadoopMapReduceCommitProtocol takes the following to be created:

          • Job ID
          • Path
          • dynamicPartitionOverwrite flag (default: false)

            HadoopMapReduceCommitProtocol is created when:

            • HadoopWriteConfigUtil is requested to create a committer
            • HadoopMapReduceWriteConfigUtil is requested to create a committer
            • HadoopMapRedWriteConfigUtil is requested to create a committer
            "},{"location":"HadoopMapReduceCommitProtocol/#logging","title":"Logging","text":"

            Enable ALL logging level for org.apache.spark.internal.io.HadoopMapReduceCommitProtocol logger to see what happens inside.

            Add the following line to conf/log4j2.properties:

            logger.HadoopMapReduceCommitProtocol.name = org.apache.spark.internal.io.HadoopMapReduceCommitProtocol\nlogger.HadoopMapReduceCommitProtocol.level = all\n

            Refer to Logging.

            "},{"location":"HadoopMapReduceWriteConfigUtil/","title":"HadoopMapReduceWriteConfigUtil","text":"

            HadoopMapReduceWriteConfigUtil is a HadoopWriteConfigUtil for RDD.saveAsNewAPIHadoopDataset operator.

            "},{"location":"HadoopMapReduceWriteConfigUtil/#creating-instance","title":"Creating Instance","text":"

            HadoopMapReduceWriteConfigUtil takes the following to be created:

            • SerializableConfiguration

              HadoopMapReduceWriteConfigUtil is created when:

              • PairRDDFunctions is requested to saveAsNewAPIHadoopDataset
              "},{"location":"HadoopMapReduceWriteConfigUtil/#logging","title":"Logging","text":"

              Enable ALL logging level for org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil logger to see what happens inside.

              Add the following line to conf/log4j2.properties:

              logger.HadoopMapReduceWriteConfigUtil.name = org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil\nlogger.HadoopMapReduceWriteConfigUtil.level = all\n

              Refer to Logging.

              "},{"location":"HadoopWriteConfigUtil/","title":"HadoopWriteConfigUtil","text":"

              HadoopWriteConfigUtil[K, V] is an abstraction of writer configurers for SparkHadoopWriter to write a key-value RDD (for RDD.saveAsNewAPIHadoopDataset and RDD.saveAsHadoopDataset operators).

              "},{"location":"HadoopWriteConfigUtil/#contract","title":"Contract","text":""},{"location":"HadoopWriteConfigUtil/#assertconf","title":"assertConf
              assertConf(\n  jobContext: JobContext,\n  conf: SparkConf): Unit\n
              ","text":""},{"location":"HadoopWriteConfigUtil/#closewriter","title":"closeWriter
              closeWriter(\n  taskContext: TaskAttemptContext): Unit\n
              ","text":""},{"location":"HadoopWriteConfigUtil/#createcommitter","title":"createCommitter
              createCommitter(\n  jobId: Int): HadoopMapReduceCommitProtocol\n

              Creates a HadoopMapReduceCommitProtocol committer

              Used when:

              • SparkHadoopWriter is requested to write data out
              ","text":""},{"location":"HadoopWriteConfigUtil/#createjobcontext","title":"createJobContext
              createJobContext(\n  jobTrackerId: String,\n  jobId: Int): JobContext\n
              ","text":""},{"location":"HadoopWriteConfigUtil/#createtaskattemptcontext","title":"createTaskAttemptContext
              createTaskAttemptContext(\n  jobTrackerId: String,\n  jobId: Int,\n  splitId: Int,\n  taskAttemptId: Int): TaskAttemptContext\n

              Creates a Hadoop TaskAttemptContext

              ","text":""},{"location":"HadoopWriteConfigUtil/#initoutputformat","title":"initOutputFormat
              initOutputFormat(\n  jobContext: JobContext): Unit\n
              ","text":""},{"location":"HadoopWriteConfigUtil/#initwriter","title":"initWriter
              initWriter(\n  taskContext: TaskAttemptContext,\n  splitId: Int): Unit\n
              ","text":""},{"location":"HadoopWriteConfigUtil/#write","title":"write
              write(\n  pair: (K, V)): Unit\n

              Writes out the key-value pair

              Used when:

              • SparkHadoopWriter is requested to executeTask
              ","text":""},{"location":"HadoopWriteConfigUtil/#implementations","title":"Implementations","text":"
              • HadoopMapReduceWriteConfigUtil
              • HadoopMapRedWriteConfigUtil
              "},{"location":"HeartbeatReceiver/","title":"HeartbeatReceiver RPC Endpoint","text":"

              HeartbeatReceiver is a ThreadSafeRpcEndpoint that is registered on the driver as HeartbeatReceiver.

              HeartbeatReceiver receives Heartbeat messages from executors for accumulator updates (with task metrics and a Spark application's accumulators) and pass them along to TaskScheduler.

              HeartbeatReceiver is registered immediately after a Spark application is started (i.e. when SparkContext is created).

              HeartbeatReceiver is a SparkListener to get notified about new executors or executors that are no longer available.

              "},{"location":"HeartbeatReceiver/#creating-instance","title":"Creating Instance","text":"

              HeartbeatReceiver takes the following to be created:

              • SparkContext
              • Clock (default: SystemClock)

                HeartbeatReceiver is created\u00a0when SparkContext is created

                "},{"location":"HeartbeatReceiver/#taskscheduler","title":"TaskScheduler

                HeartbeatReceiver manages a reference to TaskScheduler.

                ","text":""},{"location":"HeartbeatReceiver/#rpc-messages","title":"RPC Messages","text":""},{"location":"HeartbeatReceiver/#executorremoved","title":"ExecutorRemoved

                Attributes:

                • Executor ID

                Posted when HeartbeatReceiver is notified that an executor is no longer available

                When received, HeartbeatReceiver removes the executor (from executorLastSeen internal registry).

                ","text":""},{"location":"HeartbeatReceiver/#executorregistered","title":"ExecutorRegistered

                Attributes:

                • Executor ID

                Posted when HeartbeatReceiver is notified that a new executor has been registered

                When received, HeartbeatReceiver registers the executor and the current time (in executorLastSeen internal registry).

                ","text":""},{"location":"HeartbeatReceiver/#expiredeadhosts","title":"ExpireDeadHosts

                No attributes

                When received, HeartbeatReceiver prints out the following TRACE message to the logs:

                Checking for hosts with no recent heartbeats in HeartbeatReceiver.\n

                Each executor (in executorLastSeen internal registry) is checked whether the time it was last seen is not past spark.network.timeout.

                For any such executor, HeartbeatReceiver prints out the following WARN message to the logs:

                Removing executor [executorId] with no recent heartbeats: [time] ms exceeds timeout [timeout] ms\n

                HeartbeatReceiver TaskScheduler.executorLost (with SlaveLost(\"Executor heartbeat timed out after [timeout] ms\").

                SparkContext.killAndReplaceExecutor is asynchronously called for the executor (i.e. on killExecutorThread).

                The executor is removed from the executorLastSeen internal registry.

                ","text":""},{"location":"HeartbeatReceiver/#heartbeat","title":"Heartbeat

                Attributes:

                • Executor ID
                • AccumulatorV2 updates (by task ID)
                • BlockManagerId
                • ExecutorMetrics peaks (by stage and stage attempt IDs)

                Posted when Executor informs that it is alive and reports task metrics.

                When received, HeartbeatReceiver finds the executorId executor (in executorLastSeen internal registry).

                When the executor is found, HeartbeatReceiver updates the time the heartbeat was received (in executorLastSeen internal registry).

                HeartbeatReceiver uses the Clock to know the current time.

                HeartbeatReceiver then submits an asynchronous task to notify TaskScheduler that the heartbeat was received from the executor (using TaskScheduler internal reference). HeartbeatReceiver posts a HeartbeatResponse back to the executor (with the response from TaskScheduler whether the executor has been registered already or not so it may eventually need to re-register).

                If however the executor was not found (in executorLastSeen internal registry), i.e. the executor was not registered before, you should see the following DEBUG message in the logs and the response is to notify the executor to re-register.

                Received heartbeat from unknown executor [executorId]\n

                In a very rare case, when TaskScheduler is not yet assigned to HeartbeatReceiver, you should see the following WARN message in the logs and the response is to notify the executor to re-register.

                Dropping [heartbeat] because TaskScheduler is not ready yet\n
                ","text":""},{"location":"HeartbeatReceiver/#taskschedulerisset","title":"TaskSchedulerIsSet

                No attributes

                Posted when SparkContext informs that TaskScheduler is available.

                When received, HeartbeatReceiver sets the internal reference to TaskScheduler.

                ","text":""},{"location":"HeartbeatReceiver/#onexecutoradded","title":"onExecutorAdded
                onExecutorAdded(\n  executorAdded: SparkListenerExecutorAdded): Unit\n

                onExecutorAdded sends an ExecutorRegistered message to itself.

                onExecutorAdded\u00a0is part of the SparkListener abstraction.

                ","text":""},{"location":"HeartbeatReceiver/#addexecutor","title":"addExecutor
                addExecutor(\n  executorId: String): Option[Future[Boolean]]\n

                addExecutor...FIXME

                ","text":""},{"location":"HeartbeatReceiver/#onexecutorremoved","title":"onExecutorRemoved
                onExecutorRemoved(\n  executorRemoved: SparkListenerExecutorRemoved): Unit\n

                onExecutorRemoved removes the executor.

                onExecutorRemoved\u00a0is part of the SparkListener abstraction.

                ","text":""},{"location":"HeartbeatReceiver/#removeexecutor","title":"removeExecutor
                removeExecutor(\n  executorId: String): Option[Future[Boolean]]\n

                removeExecutor...FIXME

                ","text":""},{"location":"HeartbeatReceiver/#starting-heartbeatreceiver","title":"Starting HeartbeatReceiver
                onStart(): Unit\n

                onStart sends a blocking ExpireDeadHosts every spark.network.timeoutInterval on eventLoopThread.

                onStart\u00a0is part of the RpcEndpoint abstraction.

                ","text":""},{"location":"HeartbeatReceiver/#stopping-heartbeatreceiver","title":"Stopping HeartbeatReceiver
                onStop(): Unit\n

                onStop shuts down the eventLoopThread and killExecutorThread thread pools.

                onStop\u00a0is part of the RpcEndpoint abstraction.

                ","text":""},{"location":"HeartbeatReceiver/#handling-two-way-messages","title":"Handling Two-Way Messages
                receiveAndReply(\n  context: RpcCallContext): PartialFunction[Any, Unit]\n

                receiveAndReply...FIXME

                receiveAndReply\u00a0is part of the RpcEndpoint abstraction.

                ","text":""},{"location":"HeartbeatReceiver/#thread-pools","title":"Thread Pools","text":""},{"location":"HeartbeatReceiver/#kill-executor-thread","title":"kill-executor-thread

                killExecutorThread is a daemon ScheduledThreadPoolExecutor with a single thread.

                The name of the thread pool is kill-executor-thread.

                ","text":""},{"location":"HeartbeatReceiver/#heartbeat-receiver-event-loop-thread","title":"heartbeat-receiver-event-loop-thread

                eventLoopThread is a daemon ScheduledThreadPoolExecutor with a single thread.

                The name of the thread pool is heartbeat-receiver-event-loop-thread.

                ","text":""},{"location":"HeartbeatReceiver/#expiring-dead-hosts","title":"Expiring Dead Hosts
                expireDeadHosts(): Unit\n

                expireDeadHosts...FIXME

                expireDeadHosts\u00a0is used when HeartbeatReceiver is requested to receives an ExpireDeadHosts message.

                ","text":""},{"location":"HeartbeatReceiver/#logging","title":"Logging

                Enable ALL logging level for org.apache.spark.HeartbeatReceiver logger to see what happens inside.

                Add the following line to conf/log4j.properties:

                log4j.logger.org.apache.spark.HeartbeatReceiver=ALL\n

                Refer to Logging.

                ","text":""},{"location":"InterruptibleIterator/","title":"InterruptibleIterator","text":"

                == [[InterruptibleIterator]] InterruptibleIterator -- Iterator With Support For Task Cancellation

                InterruptibleIterator is a custom Scala https://www.scala-lang.org/api/2.11.x/index.html#scala.collection.Iterator[Iterator] that supports task cancellation, i.e. <>.

                Quoting the official Scala https://www.scala-lang.org/api/2.11.x/index.html#scala.collection.Iterator[Iterator] documentation:

                Iterators are data structures that allow to iterate over a sequence of elements. They have a hasNext method for checking if there is a next element available, and a next method which returns the next element and discards it from the iterator.

                InterruptibleIterator is <> when:

                • RDD is requested to rdd:RDD.md#getOrCompute[get or compute a RDD partition]

                • CoGroupedRDD, rdd:HadoopRDD.md#compute[HadoopRDD], rdd:NewHadoopRDD.md#compute[NewHadoopRDD], rdd:ParallelCollectionRDD.md#compute[ParallelCollectionRDD] are requested to compute a partition

                • BlockStoreShuffleReader is requested to shuffle:BlockStoreShuffleReader.md#read[read combined key-value records for a reduce task]

                • PairRDDFunctions is requested to rdd:PairRDDFunctions.md#combineByKeyWithClassTag[combineByKeyWithClassTag]

                • Spark SQL's DataSourceRDD and JDBCRDD are requested to compute a partition

                • Spark SQL's RangeExec physical operator is requested to doExecute

                • PySpark's BasePythonRunner is requested to compute

                [[creating-instance]] InterruptibleIterator takes the following when created:

                • [[context]] TaskContext
                • [[delegate]] Scala Iterator[T]

                NOTE: InterruptibleIterator is a Developer API which is a lower-level, unstable API intended for Spark developers that may change or be removed in minor versions of Apache Spark.

                === [[hasNext]] hasNext Method

                "},{"location":"InterruptibleIterator/#source-scala","title":"[source, scala]","text":""},{"location":"InterruptibleIterator/#hasnext-boolean","title":"hasNext: Boolean","text":"

                NOTE: hasNext is part of ++https://www.scala-lang.org/api/2.11.x/index.html#scala.collection.Iterator@hasNext:Boolean++[Iterator Contract] to test whether this iterator can provide another element.

                hasNext requests the <> to kill the task if interrupted (that simply throws a TaskKilledException that in turn breaks the task execution).

                In the end, hasNext requests the <> to hasNext.

                === [[next]] next Method

                "},{"location":"InterruptibleIterator/#source-scala_1","title":"[source, scala]","text":""},{"location":"InterruptibleIterator/#next-t","title":"next(): T","text":"

                NOTE: next is part of ++https://www.scala-lang.org/api/2.11.x/index.html#scala.collection.Iterator@next():A++[Iterator Contract] to produce the next element of this iterator.

                next simply requests the <> to next."},{"location":"ListenerBus/","title":"ListenerBus","text":"

                ListenerBus is an abstraction of event buses that can notify listeners about scheduling events.

                "},{"location":"ListenerBus/#contract","title":"Contract","text":""},{"location":"ListenerBus/#notifying-listener-about-event","title":"Notifying Listener about Event
                doPostEvent(\n  listener: L,\n  event: E): Unit\n

                Used when ListenerBus is requested to postToAll

                ","text":""},{"location":"ListenerBus/#implementations","title":"Implementations","text":"
                • ExecutionListenerBus
                • ExternalCatalogWithListener
                • SparkListenerBus
                • StreamingListenerBus
                • StreamingQueryListenerBus
                "},{"location":"ListenerBus/#posting-event-to-all-listeners","title":"Posting Event To All Listeners
                postToAll(\n  event: E): Unit\n

                postToAll...FIXME

                postToAll\u00a0is used when:

                • AsyncEventQueue is requested to dispatch an event
                • ReplayListenerBus is requested to replay events
                ","text":""},{"location":"ListenerBus/#registering-listener","title":"Registering Listener
                addListener(\n  listener: L): Unit\n

                addListener...FIXME

                addListener\u00a0is used when:

                • LiveListenerBus is requested to addToQueue
                • EventLogFileCompactor is requested to initializeBuilders
                • FsHistoryProvider is requested to doMergeApplicationListing and rebuildAppStore
                ","text":""},{"location":"OutputCommitCoordinator/","title":"OutputCommitCoordinator","text":"

                From the scaladoc (it's a private[spark] class so no way to find it outside the code):

                Authority that decides whether tasks can commit output to HDFS. Uses a \"first committer wins\" policy.

                OutputCommitCoordinator is instantiated in both the drivers and executors. On executors, it is configured with a reference to the driver's OutputCommitCoordinatorEndpoint, so requests to commit output will be forwarded to the driver's OutputCommitCoordinator.

                This class was introduced in SPARK-4879; see that JIRA issue (and the associated pull requests) for an extensive design discussion.

                "},{"location":"OutputCommitCoordinator/#creating-instance","title":"Creating Instance","text":"

                OutputCommitCoordinator takes the following to be created:

                • SparkConf
                • isDriver flag

                  OutputCommitCoordinator is created\u00a0when:

                  • SparkEnv utility is used to create a SparkEnv on the driver
                  "},{"location":"OutputCommitCoordinator/#outputcommitcoordinator-rpc-endpoint","title":"OutputCommitCoordinator RPC Endpoint
                  coordinatorRef: Option[RpcEndpointRef]\n

                  OutputCommitCoordinator is registered as OutputCommitCoordinator (with OutputCommitCoordinatorEndpoint RPC Endpoint) in the RPC Environment on the driver (when SparkEnv utility is used to create \"base\" SparkEnv). Executors have an RpcEndpointRef to the endpoint on the driver.

                  coordinatorRef is used to post an AskPermissionToCommitOutput (by executors) to the OutputCommitCoordinator (when canCommit).

                  coordinatorRef is used to stop the OutputCommitCoordinator on the driver (when stop).

                  ","text":""},{"location":"OutputCommitCoordinator/#cancommit","title":"canCommit
                  canCommit(\n  stage: Int,\n  stageAttempt: Int,\n  partition: Int,\n  attemptNumber: Int): Boolean\n

                  canCommit creates a AskPermissionToCommitOutput message and sends it (asynchronously) to the OutputCommitCoordinator RPC Endpoint.

                  canCommit\u00a0is used when:

                  • SparkHadoopMapRedUtil is requested to commitTask (with spark.hadoop.outputCommitCoordination.enabled configuration property enabled)
                  • DataWritingSparkTask (Spark SQL) utility is used to run
                  ","text":""},{"location":"OutputCommitCoordinator/#handleaskpermissiontocommit","title":"handleAskPermissionToCommit
                  handleAskPermissionToCommit(\n  stage: Int,\n  stageAttempt: Int,\n  partition: Int,\n  attemptNumber: Int): Boolean\n

                  handleAskPermissionToCommit...FIXME

                  handleAskPermissionToCommit\u00a0is used when:

                  • OutputCommitCoordinatorEndpoint is requested to handle a AskPermissionToCommitOutput message (that happens after it was sent out in canCommit)
                  ","text":""},{"location":"OutputCommitCoordinator/#logging","title":"Logging

                  Enable ALL logging level for org.apache.spark.scheduler.OutputCommitCoordinator logger to see what happens inside.

                  Add the following line to conf/log4j.properties:

                  log4j.logger.org.apache.spark.scheduler.OutputCommitCoordinator=ALL\n

                  Refer to Logging.

                  ","text":""},{"location":"SparkConf/","title":"SparkConf","text":"

                  SparkConf is Serializable (Java).

                  "},{"location":"SparkConf/#creating-instance","title":"Creating Instance","text":"

                  SparkConf takes the following to be created:

                  • loadDefaults flag
                  "},{"location":"SparkConf/#loaddefaults-flag","title":"loadDefaults Flag

                  SparkConf can be given loadDefaults flag when created.

                  Default: true

                  When true, SparkConf loads spark properties (with silent flag disabled) when created.

                  ","text":""},{"location":"SparkConf/#getallwithprefix","title":"getAllWithPrefix
                  getAllWithPrefix(\n  prefix: String): Array[(String, String)]\n

                  getAllWithPrefix collects the keys with the given prefix in getAll.

                  In the end, getAllWithPrefix removes the given prefix from the keys.

                  getAllWithPrefix is used when:

                  • SparkConf is requested to getExecutorEnv (spark.executorEnv. prefix), fillMissingMagicCommitterConfsIfNeeded (spark.hadoop.fs.s3a.bucket. prefix)
                  • ExecutorPluginContainer is requested for the executorPlugins (spark.plugins.internal.conf. prefix)
                  • ResourceUtils is requested to parseResourceRequest, listResourceIds, addTaskResourceRequests, parseResourceRequirements
                  • SortShuffleManager is requested to loadShuffleExecutorComponents (spark.shuffle.plugin.__config__. prefix)
                  • ServerInfo is requested to addFilters
                  ","text":""},{"location":"SparkConf/#loading-spark-properties","title":"Loading Spark Properties
                  loadFromSystemProperties(\n  silent: Boolean): SparkConf\n

                  loadFromSystemProperties records all the spark.-prefixed system properties in this SparkConf.

                  Silently loading system properties

                  Loading system properties silently is possible using the following:

                  new SparkConf(loadDefaults = false).loadFromSystemProperties(silent = true)\n

                  loadFromSystemProperties is used when:

                  • SparkConf is created (with loadDefaults enabled)
                  • SparkHadoopUtil is created
                  ","text":""},{"location":"SparkConf/#executor-settings","title":"Executor Settings

                  SparkConf uses spark.executorEnv. prefix for executor settings.

                  ","text":""},{"location":"SparkConf/#getexecutorenv","title":"getExecutorEnv
                  getExecutorEnv: Seq[(String, String)]\n

                  getExecutorEnv gets all the settings with spark.executorEnv. prefix.

                  getExecutorEnv is used when:

                  • SparkContext is created (and requested for executorEnvs)
                  ","text":""},{"location":"SparkConf/#setexecutorenv","title":"setExecutorEnv
                  setExecutorEnv(\n  variables: Array[(String, String)]): SparkConf\nsetExecutorEnv(\n  variables: Seq[(String, String)]): SparkConf\nsetExecutorEnv(\n  variable: String, value: String): SparkConf\n

                  setExecutorEnv sets the given (key-value) variables with the keys with spark.executorEnv. prefix added.

                  setExecutorEnv is used when:

                  • SparkContext is requested to updatedConf
                  ","text":""},{"location":"SparkConf/#logging","title":"Logging

                  Enable ALL logging level for org.apache.spark.SparkConf logger to see what happens inside.

                  Add the following line to conf/log4j.properties:

                  log4j.logger.org.apache.spark.SparkConf=ALL\n

                  Refer to Logging.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/","title":"Inside Creating SparkContext","text":"

                  This document describes the internals of what happens when a new SparkContext is created.

                  import org.apache.spark.{SparkConf, SparkContext}\n\n// 1. Create Spark configuration\nval conf = new SparkConf()\n  .setAppName(\"SparkMe Application\")\n  .setMaster(\"local[*]\")\n\n// 2. Create Spark context\nval sc = new SparkContext(conf)\n
                  "},{"location":"SparkContext-creating-instance-internals/#creationsite","title":"creationSite
                  creationSite: CallSite\n

                  SparkContext determines call site.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#assertondriver","title":"assertOnDriver

                  SparkContext...FIXME

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#markpartiallyconstructed","title":"markPartiallyConstructed

                  SparkContext...FIXME

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#starttime","title":"startTime
                  startTime: Long\n

                  SparkContext records the current time (in ms).

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#stopped","title":"stopped
                  stopped: AtomicBoolean\n

                  SparkContext initializes stopped flag to false.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#printing-out-spark-version","title":"Printing Out Spark Version

                  SparkContext prints out the following INFO message to the logs:

                  Running Spark version [SPARK_VERSION]\n
                  ","text":""},{"location":"SparkContext-creating-instance-internals/#sparkuser","title":"sparkUser
                  sparkUser: String\n

                  SparkContext determines Spark user.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#sparkconf","title":"SparkConf
                  _conf: SparkConf\n

                  SparkContext clones the SparkConf and requests it to validateSettings.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#enforcing-mandatory-configuration-properties","title":"Enforcing Mandatory Configuration Properties

                  SparkContext asserts that spark.master and spark.app.name are defined (in the SparkConf).

                  A master URL must be set in your configuration\n
                  An application name must be set in your configuration\n
                  ","text":""},{"location":"SparkContext-creating-instance-internals/#driverlogger","title":"DriverLogger
                  _driverLogger: Option[DriverLogger]\n

                  SparkContext creates a DriverLogger.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#resourceinformation","title":"ResourceInformation
                  _resources: Map[String, ResourceInformation]\n

                  SparkContext uses spark.driver.resourcesFile configuration property to discovery driver resources and prints out the following INFO message to the logs:

                  ==============================================================\nResources for [componentName]:\n[resources]\n==============================================================\n
                  ","text":""},{"location":"SparkContext-creating-instance-internals/#submitted-application","title":"Submitted Application

                  SparkContext prints out the following INFO message to the logs (with the value of spark.app.name configuration property):

                  Submitted application: [appName]\n
                  ","text":""},{"location":"SparkContext-creating-instance-internals/#spark-on-yarn-and-sparkyarnappid","title":"Spark on YARN and spark.yarn.app.id

                  For Spark on YARN in cluster deploy mode], SparkContext checks whether spark.yarn.app.id configuration property is defined. SparkException is thrown if it does not exist.

                  Detected yarn cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.\n
                  ","text":""},{"location":"SparkContext-creating-instance-internals/#displaying-spark-configuration","title":"Displaying Spark Configuration

                  With spark.logConf configuration property enabled, SparkContext prints out the following INFO message to the logs:

                  Spark configuration:\n[conf.toDebugString]\n

                  Note

                  SparkConf.toDebugString is used very early in the initialization process and other settings configured afterwards are not included. Use SparkContext.getConf.toDebugString once SparkContext is initialized.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#setting-configuration-properties","title":"Setting Configuration Properties
                  • spark.driver.host to the current value of the property (to override the default)
                  • spark.driver.port to 0 unless defined already
                  • spark.executor.id to driver
                  ","text":""},{"location":"SparkContext-creating-instance-internals/#user-defined-jar-files","title":"User-Defined Jar Files
                  _jars: Seq[String]\n

                  SparkContext sets the _jars to spark.jars configuration property.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#user-defined-files","title":"User-Defined Files
                  _files: Seq[String]\n

                  SparkContext sets the _files to spark.files configuration property.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#sparkeventlogdir","title":"spark.eventLog.dir
                  _eventLogDir: Option[URI]\n

                  If spark-history-server:EventLoggingListener.md[event logging] is enabled, i.e. EventLoggingListener.md#spark_eventLog_enabled[spark.eventLog.enabled] flag is true, the internal field _eventLogDir is set to the value of EventLoggingListener.md#spark_eventLog_dir[spark.eventLog.dir] setting or the default value /tmp/spark-events.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#sparkeventlogcompress","title":"spark.eventLog.compress
                  _eventLogCodec: Option[String]\n

                  Also, if spark-history-server:EventLoggingListener.md#spark_eventLog_compress[spark.eventLog.compress] is enabled (it is not by default), the short name of the CompressionCodec is assigned to _eventLogCodec. The config key is spark.io.compression.codec (default: lz4).

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#creating-livelistenerbus","title":"Creating LiveListenerBus
                  _listenerBus: LiveListenerBus\n

                  SparkContext creates a LiveListenerBus.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#creating-appstatusstore-and-appstatussource","title":"Creating AppStatusStore (and AppStatusSource)
                  _statusStore: AppStatusStore\n

                  SparkContext creates an in-memory store (with an optional AppStatusSource if enabled) and requests the LiveListenerBus to register the AppStatusListener with the status queue.

                  The AppStatusStore is available using the statusStore property of the SparkContext.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#creating-sparkenv","title":"Creating SparkEnv
                  _env: SparkEnv\n

                  SparkContext creates a SparkEnv and requests SparkEnv to use the instance as the default SparkEnv.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#sparkreplclassuri","title":"spark.repl.class.uri

                  With spark.repl.class.outputDir configuration property defined, SparkContext sets spark.repl.class.uri configuration property to be...FIXME

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#creating-sparkstatustracker","title":"Creating SparkStatusTracker
                  _statusTracker: SparkStatusTracker\n

                  SparkContext creates a SparkStatusTracker (with itself and the AppStatusStore).

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#creating-consoleprogressbar","title":"Creating ConsoleProgressBar
                  _progressBar: Option[ConsoleProgressBar]\n

                  SparkContext creates a ConsoleProgressBar only when spark.ui.showConsoleProgress configuration property is enabled.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#creating-sparkui","title":"Creating SparkUI
                  _ui: Option[SparkUI]\n

                  SparkContext creates a SparkUI only when spark.ui.enabled configuration property is enabled.

                  SparkContext requests the SparkUI to bind.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#hadoop-configuration","title":"Hadoop Configuration
                  _hadoopConfiguration: Configuration\n

                  SparkContext creates a new Hadoop Configuration.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#adding-user-defined-jar-files","title":"Adding User-Defined Jar Files

                  If there are jars given through the SparkContext constructor, they are added using addJar.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#adding-user-defined-files","title":"Adding User-Defined Files

                  SparkContext adds the files in spark.files configuration property.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#_executormemory","title":"_executorMemory
                  _executorMemory: Int\n

                  SparkContext determines the amount of memory to allocate to each executor. It is the value of executor:Executor.md#spark.executor.memory[spark.executor.memory] setting, or SparkContext.md#environment-variables[SPARK_EXECUTOR_MEMORY] environment variable (or currently-deprecated SPARK_MEM), or defaults to 1024.

                  _executorMemory is later available as sc.executorMemory and used for LOCAL_CLUSTER_REGEX, SparkDeploySchedulerBackend, to set executorEnvs(\"SPARK_EXECUTOR_MEMORY\"), MesosSchedulerBackend, CoarseMesosSchedulerBackend.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#spark_prepend_classes-environment-variable","title":"SPARK_PREPEND_CLASSES Environment Variable

                  The value of SPARK_PREPEND_CLASSES environment variable is included in executorEnvs.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#for-mesos-schedulerbackend-only","title":"For Mesos SchedulerBackend Only

                  The Mesos scheduler backend's configuration is included in executorEnvs, i.e. SparkContext.md#environment-variables[SPARK_EXECUTOR_MEMORY], _conf.getExecutorEnv, and SPARK_USER.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#shuffledrivercomponents","title":"ShuffleDriverComponents
                  _shuffleDriverComponents: ShuffleDriverComponents\n

                  SparkContext...FIXME

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#registering-heartbeatreceiver","title":"Registering HeartbeatReceiver

                  SparkContext registers HeartbeatReceiver RPC endpoint.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#plugincontainer","title":"PluginContainer
                  _plugins: Option[PluginContainer]\n

                  SparkContext creates a PluginContainer (with itself and the _resources).

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#creating-schedulerbackend-and-taskscheduler","title":"Creating SchedulerBackend and TaskScheduler

                  SparkContext object is requested to SparkContext.md#createTaskScheduler[create the SchedulerBackend with the TaskScheduler] (for the given master URL) and the result becomes the internal _schedulerBackend and _taskScheduler.

                  scheduler:DAGScheduler.md#creating-instance[DAGScheduler is created] (as _dagScheduler).

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#sending-blocking-taskschedulerisset","title":"Sending Blocking TaskSchedulerIsSet

                  SparkContext sends a blocking TaskSchedulerIsSet message to HeartbeatReceiver RPC endpoint (to inform that the TaskScheduler is now available).

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#executormetricssource","title":"ExecutorMetricsSource

                  SparkContext creates an ExecutorMetricsSource when the spark.metrics.executorMetricsSource.enabled is enabled.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#heartbeater","title":"Heartbeater

                  SparkContext creates a Heartbeater and starts it.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#starting-taskscheduler","title":"Starting TaskScheduler

                  SparkContext requests the TaskScheduler to start.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#setting-spark-applications-and-execution-attempts-ids","title":"Setting Spark Application's and Execution Attempt's IDs

                  SparkContext sets the internal fields -- _applicationId and _applicationAttemptId -- (using applicationId and applicationAttemptId methods from the scheduler:TaskScheduler.md#contract[TaskScheduler Contract]).

                  NOTE: SparkContext requests TaskScheduler for the scheduler:TaskScheduler.md#applicationId[unique identifier of a Spark application] (that is currently only implemented by scheduler:TaskSchedulerImpl.md#applicationId[TaskSchedulerImpl] that uses SchedulerBackend to scheduler:SchedulerBackend.md#applicationId[request the identifier]).

                  NOTE: The unique identifier of a Spark application is used to initialize spark-webui-SparkUI.md#setAppId[SparkUI] and storage:BlockManager.md#initialize[BlockManager].

                  NOTE: _applicationAttemptId is used when SparkContext is requested for the SparkContext.md#applicationAttemptId[unique identifier of execution attempt of a Spark application] and when EventLoggingListener spark-history-server:EventLoggingListener.md#creating-instance[is created].

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#setting-sparkappid-spark-property-in-sparkconf","title":"Setting spark.app.id Spark Property in SparkConf

                  SparkContext sets SparkConf.md#spark.app.id[spark.app.id] property to be the <<_applicationId, unique identifier of a Spark application>> and, if enabled, spark-webui-SparkUI.md#setAppId[passes it on to SparkUI].

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#sparkuiproxybase","title":"spark.ui.proxyBase","text":""},{"location":"SparkContext-creating-instance-internals/#initializing-sparkui","title":"Initializing SparkUI

                  SparkContext requests the SparkUI (if defined) to setAppId with the _applicationId.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#initializing-blockmanager","title":"Initializing BlockManager

                  The storage:BlockManager.md#initialize[BlockManager (for the driver) is initialized] (with _applicationId).

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#starting-metricssystem","title":"Starting MetricsSystem

                  SparkContext requests the MetricsSystem to start (with the value of thespark.metrics.staticSources.enabled configuration property).

                  Note

                  SparkContext starts the MetricsSystem after <> as MetricsSystem uses it to build unique identifiers fo metrics sources.","text":""},{"location":"SparkContext-creating-instance-internals/#attaching-servlet-handlers-to-web-ui","title":"Attaching Servlet Handlers to web UI

                  SparkContext requests the MetricsSystem for servlet handlers and requests the SparkUI to attach them.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#starting-eventlogginglistener-with-event-log-enabled","title":"Starting EventLoggingListener (with Event Log Enabled)
                  _eventLogger: Option[EventLoggingListener]\n

                  With spark.eventLog.enabled configuration property enabled, SparkContext creates an EventLoggingListener and requests it to start.

                  SparkContext requests the LiveListenerBus to add the EventLoggingListener to eventLog event queue.

                  With spark.eventLog.enabled disabled, _eventLogger is None (undefined).

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#contextcleaner","title":"ContextCleaner
                  _cleaner: Option[ContextCleaner]\n

                  With spark.cleaner.referenceTracking configuration property enabled, SparkContext creates a ContextCleaner (with itself and the _shuffleDriverComponents).

                  SparkContext requests the ContextCleaner to start

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#executorallocationmanager","title":"ExecutorAllocationManager
                  _executorAllocationManager: Option[ExecutorAllocationManager]\n

                  SparkContext initializes _executorAllocationManager internal registry.

                  SparkContext creates an ExecutorAllocationManager when:

                  • Dynamic Allocation of Executors is enabled (based on spark.dynamicAllocation.enabled configuration property and the master URL)

                  • SchedulerBackend is an ExecutorAllocationClient

                  The ExecutorAllocationManager is requested to start.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#registering-user-defined-sparklisteners","title":"Registering User-Defined SparkListeners

                  SparkContext registers user-defined listeners and starts SparkListenerEvent event delivery to the listeners.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#postenvironmentupdate","title":"postEnvironmentUpdate

                  postEnvironmentUpdate is called that posts SparkListener.md#SparkListenerEnvironmentUpdate[SparkListenerEnvironmentUpdate] message on scheduler:LiveListenerBus.md[] with information about Task Scheduler's scheduling mode, added jar and file paths, and other environmental details.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#postapplicationstart","title":"postApplicationStart

                  SparkListener.md#SparkListenerApplicationStart[SparkListenerApplicationStart] message is posted to scheduler:LiveListenerBus.md[] (using the internal postApplicationStart method).

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#poststarthook","title":"postStartHook

                  TaskScheduler scheduler:TaskScheduler.md#postStartHook[is notified that SparkContext is almost fully initialized].

                  NOTE: scheduler:TaskScheduler.md#postStartHook[TaskScheduler.postStartHook] does nothing by default, but custom implementations offer more advanced features, i.e. TaskSchedulerImpl scheduler:TaskSchedulerImpl.md#postStartHook[blocks the current thread until SchedulerBackend is ready]. There is also YarnClusterScheduler for Spark on YARN in cluster deploy mode.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#registering-metrics-sources","title":"Registering Metrics Sources

                  SparkContext requests MetricsSystem to register metrics sources for the following services:

                  • DAGScheduler
                  • BlockManager
                  • ExecutorAllocationManager
                  ","text":""},{"location":"SparkContext-creating-instance-internals/#adding-shutdown-hook","title":"Adding Shutdown Hook

                  SparkContext adds a shutdown hook (using ShutdownHookManager.addShutdownHook()).

                  SparkContext prints out the following DEBUG message to the logs:

                  Adding shutdown hook\n

                  CAUTION: FIXME ShutdownHookManager.addShutdownHook()

                  Any non-fatal Exception leads to termination of the Spark context instance.

                  CAUTION: FIXME What does NonFatal represent in Scala?

                  CAUTION: FIXME Finish me

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#initializing-nextshuffleid-and-nextrddid-internal-counters","title":"Initializing nextShuffleId and nextRddId Internal Counters

                  nextShuffleId and nextRddId start with 0.

                  CAUTION: FIXME Where are nextShuffleId and nextRddId used?

                  A new instance of Spark context is created and ready for operation.

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#loading-external-cluster-manager-for-url-getclustermanager-method","title":"Loading External Cluster Manager for URL (getClusterManager method)
                  getClusterManager(\n  url: String): Option[ExternalClusterManager]\n

                  getClusterManager loads scheduler:ExternalClusterManager.md[] that scheduler:ExternalClusterManager.md#canCreate[can handle the input url].

                  If there are two or more external cluster managers that could handle url, a SparkException is thrown:

                  Multiple Cluster Managers ([serviceLoaders]) registered for the url [url].\n

                  NOTE: getClusterManager uses Java's ++https://docs.oracle.com/javase/8/docs/api/java/util/ServiceLoader.html#load-java.lang.Class-java.lang.ClassLoader-++[ServiceLoader.load] method.

                  NOTE: getClusterManager is used to find a cluster manager for a master URL when SparkContext.md#createTaskScheduler[creating a SchedulerBackend and a TaskScheduler for the driver].

                  ","text":""},{"location":"SparkContext-creating-instance-internals/#setupandstartlistenerbus","title":"setupAndStartListenerBus
                  setupAndStartListenerBus(): Unit\n

                  setupAndStartListenerBus is an internal method that reads configuration-properties.md#spark.extraListeners[spark.extraListeners] configuration property from the current SparkConf.md[SparkConf] to create and register SparkListenerInterface listeners.

                  It expects that the class name represents a SparkListenerInterface listener with one of the following constructors (in this order):

                  • a single-argument constructor that accepts SparkConf.md[SparkConf]
                  • a zero-argument constructor

                  setupAndStartListenerBus scheduler:LiveListenerBus.md#ListenerBus-addListener[registers every listener class].

                  You should see the following INFO message in the logs:

                  INFO Registered listener [className]\n

                  It scheduler:LiveListenerBus.md#start[starts LiveListenerBus] and records it in the internal _listenerBusStarted.

                  When no single-SparkConf or zero-argument constructor could be found for a class name in configuration-properties.md#spark.extraListeners[spark.extraListeners] configuration property, a SparkException is thrown with the message:

                  [className] did not have a zero-argument constructor or a single-argument constructor that accepts SparkConf. Note: if the class is defined inside of another Scala class, then its constructors may accept an implicit parameter that references the enclosing class; in this case, you must define the listener as a top-level class in order to prevent this extra parameter from breaking Spark's ability to find a valid constructor.\n

                  Any exception while registering a SparkListenerInterface listener stops the SparkContext and a SparkException is thrown and the source exception's message.

                  Exception when registering SparkListener\n

                  Tip

                  Set INFO logging level for org.apache.spark.SparkContext logger to see the extra listeners being registered.

                  Registered listener pl.japila.spark.CustomSparkListener\n
                  ","text":""},{"location":"SparkContext/","title":"SparkContext","text":"

                  SparkContext is the entry point to all of the components of Apache Spark (execution engine) and so the heart of a Spark application. In fact, you can consider an application a Spark application only when it uses a SparkContext (directly or indirectly).

                  Important

                  There should be one active SparkContext per JVM and Spark developers should use SparkContext.getOrCreate utility for sharing it (e.g. across threads).

                  "},{"location":"SparkContext/#creating-instance","title":"Creating Instance","text":"

                  SparkContext takes the following to be created:

                  • SparkConf

                    SparkContext is created (directly or indirectly using getOrCreate utility).

                    While being created, SparkContext sets up core services and establishes a connection to a cluster manager.

                    "},{"location":"SparkContext/#checkpoint-directory","title":"Checkpoint Directory

                    SparkContext defines checkpointDir internal registry for the path to a checkpoint directory.

                    checkpointDir is undefined (None) when SparkContext is created and is set using setCheckpointDir.

                    checkpointDir is required for Reliable Checkpointing.

                    checkpointDir is available using getCheckpointDir.

                    ","text":""},{"location":"SparkContext/#getcheckpointdir","title":"getCheckpointDir
                    getCheckpointDir: Option[String]\n

                    getCheckpointDir returns the checkpointDir.

                    getCheckpointDir is used when:

                    • ReliableRDDCheckpointData is requested for the checkpoint path
                    ","text":""},{"location":"SparkContext/#submitting-mapstage-for-execution","title":"Submitting MapStage for Execution
                    submitMapStage[K, V, C](\n  dependency: ShuffleDependency[K, V, C]): SimpleFutureAction[MapOutputStatistics]\n

                    submitMapStage requests the DAGScheduler to submit the given ShuffleDependency for execution (that eventually produces a MapOutputStatistics).

                    submitMapStage is used when:

                    • ShuffleExchangeExec (Spark SQL) unary physical operator is executed
                    ","text":""},{"location":"SparkContext/#executormetricssource","title":"ExecutorMetricsSource

                    SparkContext creates an ExecutorMetricsSource when created with spark.metrics.executorMetricsSource.enabled enabled.

                    SparkContext requests the ExecutorMetricsSource to register with the MetricsSystem.

                    SparkContext uses the ExecutorMetricsSource to create the Heartbeater.

                    ","text":""},{"location":"SparkContext/#services","title":"Services
                    • ExecutorAllocationManager (optional)

                    • SchedulerBackend","text":""},{"location":"SparkContext/#resourceprofilemanager","title":"ResourceProfileManager

                      SparkContext creates a ResourceProfileManager when created.

                      ","text":""},{"location":"SparkContext/#resourceprofilemanager_1","title":"resourceProfileManager
                      resourceProfileManager: ResourceProfileManager\n

                      resourceProfileManager returns the ResourceProfileManager.

                      resourceProfileManager is used when:

                      • KubernetesClusterSchedulerBackend (Spark on Kubernetes) is created
                      • others
                      ","text":""},{"location":"SparkContext/#driverlogger","title":"DriverLogger

                      SparkContext can create a DriverLogger when created.

                      SparkContext requests the DriverLogger to startSync in postApplicationStart.

                      ","text":""},{"location":"SparkContext/#appstatussource","title":"AppStatusSource

                      SparkContext can create an AppStatusSource when created (based on the spark.metrics.appStatusSource.enabled configuration property).

                      SparkContext uses the AppStatusSource to create the AppStatusStore.

                      If configured, SparkContext registers the AppStatusSource with the MetricsSystem.

                      ","text":""},{"location":"SparkContext/#appstatusstore","title":"AppStatusStore

                      SparkContext creates an AppStatusStore when created (with itself and the AppStatusSource).

                      SparkContext requests AppStatusStore for the AppStatusListener and requests the LiveListenerBus to add it to the application status queue.

                      SparkContext uses the AppStatusStore to create the following:

                      • SparkStatusTracker
                      • SparkUI

                      AppStatusStore is requested to status/AppStatusStore.md#close in stop.

                      ","text":""},{"location":"SparkContext/#statusstore","title":"statusStore
                      statusStore: AppStatusStore\n

                      statusStore returns the AppStatusStore.

                      statusStore is used when:

                      • SparkContext is requested to getRDDStorageInfo
                      • ConsoleProgressBar is requested to refresh
                      • HiveThriftServer2 is requested to createListenerAndUI
                      • SharedState (Spark SQL) is requested for a SQLAppStatusStore and a StreamingQueryStatusListener
                      ","text":""},{"location":"SparkContext/#sparkstatustracker","title":"SparkStatusTracker

                      SparkContext creates a SparkStatusTracker when created (with itself and the AppStatusStore).

                      ","text":""},{"location":"SparkContext/#statustracker","title":"statusTracker
                      statusTracker: SparkStatusTracker\n

                      statusTracker returns the SparkStatusTracker.

                      ","text":""},{"location":"SparkContext/#local-properties","title":"Local Properties
                      localProperties: InheritableThreadLocal[Properties]\n

                      SparkContext uses an InheritableThreadLocal (Java) of key-value pairs of thread-local properties to pass extra information from a parent thread (on the driver) to child threads.

                      localProperties is meant to be used by developers using SparkContext.setLocalProperty and SparkContext.getLocalProperty.

                      Local Properties are available using TaskContext.getLocalProperty.

                      Local Properties are available to SparkListeners using the following events:

                      • SparkListenerJobStart
                      • SparkListenerStageSubmitted

                      localProperties are passed down when SparkContext is requested for the following:

                      • Running Job (that in turn makes the local properties available to the DAGScheduler to run a job)
                      • Running Approximate Job
                      • Submitting Job
                      • Submitting MapStage

                      DAGScheduler passes down local properties when scheduling:

                      • ShuffleMapTasks
                      • ResultTasks
                      • TaskSets

                      Spark (Core) defines the following local properties.

                      Name Default Value Setter callSite.long callSite.short SparkContext.setCallSite spark.job.description callSite.short SparkContext.setJobDescription (SparkContext.setJobGroup) spark.job.interruptOnCancel SparkContext.setJobGroup spark.jobGroup.id SparkContext.setJobGroup spark.scheduler.pool","text":""},{"location":"SparkContext/#shuffledrivercomponents","title":"ShuffleDriverComponents

                      SparkContext creates a ShuffleDriverComponents when created.

                      SparkContext loads the ShuffleDataIO that is in turn requested for the ShuffleDriverComponents. SparkContext requests the ShuffleDriverComponents to initialize.

                      The ShuffleDriverComponents is used when:

                      • ShuffleDependency is created
                      • SparkContext creates the ContextCleaner (if enabled)

                      SparkContext requests the ShuffleDriverComponents to clean up when stopping.

                      ","text":""},{"location":"SparkContext/#static-files","title":"Static Files","text":""},{"location":"SparkContext/#addfile","title":"addFile
                      addFile(\n  path: String,\n  recursive: Boolean): Unit\n// recursive = false\naddFile(\n  path: String): Unit\n

                      Firstly, addFile validate the schema of given path. For a no-schema path, addFile converts it to a canonical form. For a local schema path, addFile prints out the following WARN message to the logs and exits.

                      File with 'local' scheme is not supported to add to file server, since it is already available on every node.\n
                      And for other schema path, addFile creates a Hadoop Path from the given path.

                      addFile Will validate the URL if the path is an HTTP, HTTPS or FTP URI.

                      addFile Will throw SparkException with below message if path is local directories but not in local mode.

                      addFile does not support local directories when not running local mode.\n

                      addFile Will throw SparkException with below message if path is directories but not turn on recursive flag.

                      Added file $hadoopPath is a directory and recursive is not turned on.\n

                      In the end, addFile adds the file to the addedFiles internal registry (with the current timestamp):

                      • For new files, addFile prints out the following INFO message to the logs, fetches the file (to the root directory and without using the cache) and postEnvironmentUpdate.

                        Added file [path] at [key] with timestamp [timestamp]\n
                      • For files that were already added, addFile prints out the following WARN message to the logs:

                        The path [path] has been added already. Overwriting of added paths is not supported in the current version.\n

                      addFile is used when:

                      • SparkContext is created
                      ","text":""},{"location":"SparkContext/#listfiles","title":"listFiles
                      listFiles(): Seq[String]\n

                      listFiles is the files added.

                      ","text":""},{"location":"SparkContext/#addedfiles-internal-registry","title":"addedFiles Internal Registry
                      addedFiles: Map[String, Long]\n

                      addedFiles is a collection of static files by the timestamp the were added at.

                      addedFiles is used when:

                      • SparkContext is requested to postEnvironmentUpdate and listFiles
                      • TaskSetManager is created (and resourceOffer)
                      ","text":""},{"location":"SparkContext/#files","title":"files
                      files: Seq[String]\n

                      files is a collection of file paths defined by spark.files configuration property.

                      ","text":""},{"location":"SparkContext/#posting-sparklistenerenvironmentupdate-event","title":"Posting SparkListenerEnvironmentUpdate Event
                      postEnvironmentUpdate(): Unit\n

                      postEnvironmentUpdate...FIXME

                      postEnvironmentUpdate is used when:

                      • SparkContext is requested to addFile and addJar
                      ","text":""},{"location":"SparkContext/#getorcreate-utility","title":"getOrCreate Utility
                      getOrCreate(): SparkContext\ngetOrCreate(\n  config: SparkConf): SparkContext\n

                      getOrCreate...FIXME

                      ","text":""},{"location":"SparkContext/#plugincontainer","title":"PluginContainer

                      SparkContext creates a PluginContainer when created.

                      PluginContainer is created (for the driver where SparkContext lives) using PluginContainer.apply utility.

                      PluginContainer is then requested to registerMetrics with the applicationId.

                      PluginContainer is requested to shutdown when SparkContext is requested to stop.

                      ","text":""},{"location":"SparkContext/#creating-schedulerbackend-and-taskscheduler","title":"Creating SchedulerBackend and TaskScheduler
                      createTaskScheduler(\n  sc: SparkContext,\n  master: String,\n  deployMode: String): (SchedulerBackend, TaskScheduler)\n

                      createTaskScheduler creates a SchedulerBackend and a TaskScheduler for the given master URL and deployment mode.

                      Internally, createTaskScheduler branches off per the given master URL to select the requested implementations.

                      createTaskScheduler accepts the following master URLs:

                      • local - local mode with 1 thread only
                      • local[n] or local[*] - local mode with n threads
                      • local[n, m] or local[*, m] -- local mode with n threads and m number of failures
                      • spark://hostname:port for Spark Standalone
                      • local-cluster[n, m, z] -- local cluster with n workers, m cores per worker, and z memory per worker
                      • Other URLs are simply handed over to getClusterManager to load an external cluster manager if available

                      createTaskScheduler is used when SparkContext is created.

                      ","text":""},{"location":"SparkContext/#loading-externalclustermanager","title":"Loading ExternalClusterManager
                      getClusterManager(\n  url: String): Option[ExternalClusterManager]\n

                      getClusterManager uses Java's ServiceLoader to find and load an ExternalClusterManager that supports the given master URL.

                      ExternalClusterManager Service Discovery

                      For ServiceLoader to find ExternalClusterManagers, they have to be registered using the following file:

                      META-INF/services/org.apache.spark.scheduler.ExternalClusterManager\n

                      getClusterManager throws a SparkException when multiple cluster managers were found:

                      Multiple external cluster managers registered for the url [url]: [serviceLoaders]\n

                      getClusterManager\u00a0is used when SparkContext is requested for a SchedulerBackend and TaskScheduler.

                      ","text":""},{"location":"SparkContext/#runJob","title":"Running Job (Synchronously)
                      runJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) => U): Array[U]\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  processPartition: (TaskContext, Iterator[T]) => U,\n  resultHandler: (Int, U) => Unit): Unit\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) => U,\n  partitions: Seq[Int]): Array[U]\nrunJob[T, U: ClassTag]( // (1)!\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) => U,\n  partitions: Seq[Int],\n  resultHandler: (Int, U) => Unit): Unit\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: Iterator[T] => U): Array[U]\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  processPartition: Iterator[T] => U,\n  resultHandler: (Int, U) => Unit): Unit\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: Iterator[T] => U,\n  partitions: Seq[Int]): Array[U]\n
                      1. Requests the DAGScheduler to run a job

                      runJob determines the call site and cleans up the given func function.

                      runJob prints out the following INFO message to the logs:

                      Starting job: [callSite]\n

                      With spark.logLineage enabled, runJob requests the given RDD for the recursive dependencies and prints out the following INFO message to the logs:

                      RDD's recursive dependencies:\n[toDebugString]\n

                      runJob requests the DAGScheduler to run a job with the following:

                      • The given rdd
                      • The given func cleaned up
                      • The given partitions
                      • The call site
                      • The given resultHandler function (procedure)
                      • The local properties

                      Note

                      runJob is blocked until the job has finished (regardless of the result, successful or not).

                      runJob requests the ConsoleProgressBar (if available) to finishAll.

                      In the end, runJob requests the given RDD to doCheckpoint.

                      ","text":""},{"location":"SparkContext/#runJob-demo","title":"Demo

                      runJob is essentially executing a func function on all or a subset of partitions of an RDD and returning the result as an array (with elements being the results per partition).

                      sc.setLocalProperty(\"callSite.short\", \"runJob Demo\")\n\nval partitionsNumber = 4\nval rdd = sc.parallelize(\n  Seq(\"hello world\", \"nice to see you\"),\n  numSlices = partitionsNumber)\n\nimport org.apache.spark.TaskContext\nval func = (t: TaskContext, ss: Iterator[String]) => 1\nval result = sc.runJob(rdd, func)\nassert(result.length == partitionsNumber)\n\nsc.clearCallSite()\n
                      ","text":""},{"location":"SparkContext/#call-site","title":"Call Site
                      getCallSite(): CallSite\n

                      getCallSite...FIXME

                      getCallSite\u00a0is used when:

                      • SparkContext is requested to broadcast, runJob, runApproximateJob, submitJob and submitMapStage
                      • AsyncRDDActions is requested to takeAsync
                      • RDD is created
                      ","text":""},{"location":"SparkContext/#closure-cleaning","title":"Closure Cleaning
                      clean(\n  f: F,\n  checkSerializable: Boolean = true): F\n

                      clean cleans up the given f closure (using ClosureCleaner.clean utility).

                      Tip

                      Enable DEBUG logging level for org.apache.spark.util.ClosureCleaner logger to see what happens inside the class.

                      Add the following line to conf/log4j.properties:

                      log4j.logger.org.apache.spark.util.ClosureCleaner=DEBUG\n

                      Refer to Logging.

                      With DEBUG logging level you should see the following messages in the logs:

                      +++ Cleaning closure [func] ([func.getClass.getName]) +++\n + declared fields: [declaredFields.size]\n     [field]\n ...\n+++ closure [func] ([func.getClass.getName]) is now cleaned +++\n
                      ","text":""},{"location":"SparkContext/#maxNumConcurrentTasks","title":"Maximum Number of Concurrent Tasks
                      maxNumConcurrentTasks(\n  rp: ResourceProfile): Int\n

                      maxNumConcurrentTasks requests the SchedulerBackend for the maximum number of tasks that can be launched concurrently (with the given ResourceProfile).

                      maxNumConcurrentTasks is used when:

                      • DAGScheduler is requested to checkBarrierStageWithNumSlots
                      ","text":""},{"location":"SparkContext/#withScope","title":"withScope
                      withScope[U](\n  body: => U): U\n

                      withScope withScope with this SparkContext.

                      Note

                      withScope is used for most (if not all) SparkContext API operators.

                      ","text":""},{"location":"SparkContext/#logging","title":"Logging

                      Enable ALL logging level for org.apache.spark.SparkContext logger to see what happens inside.

                      Add the following line to conf/log4j2.properties:

                      logger.SparkContext.name = org.apache.spark.SparkContext\nlogger.SparkContext.level = all\n

                      Refer to Logging.

                      ","text":""},{"location":"SparkCoreErrors/","title":"SparkCoreErrors","text":""},{"location":"SparkCoreErrors/#numPartitionsGreaterThanMaxNumConcurrentTasksError","title":"numPartitionsGreaterThanMaxNumConcurrentTasksError","text":"
                      numPartitionsGreaterThanMaxNumConcurrentTasksError(\n  numPartitions: Int,\n  maxNumConcurrentTasks: Int): Throwable\n

                      numPartitionsGreaterThanMaxNumConcurrentTasksError creates a BarrierJobSlotsNumberCheckFailed with the given input arguments.

                      numPartitionsGreaterThanMaxNumConcurrentTasksError is used when:

                      • DAGScheduler is requested to checkBarrierStageWithNumSlots
                      "},{"location":"SparkEnv/","title":"SparkEnv","text":"

                      SparkEnv is a handle to Spark Execution Environment with the core services of Apache Spark (that interact with each other to establish a distributed computing platform for a Spark application).

                      There are two separate SparkEnvs of the driver and executors.

                      ","tags":["DeveloperApi"]},{"location":"SparkEnv/#core-services","title":"Core Services Property Service blockManager BlockManager broadcastManager BroadcastManager closureSerializer Serializer conf SparkConf mapOutputTracker MapOutputTracker memoryManager MemoryManager metricsSystem MetricsSystem outputCommitCoordinator OutputCommitCoordinator rpcEnv RpcEnv securityManager SecurityManager serializer Serializer serializerManager SerializerManager shuffleManager ShuffleManager","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#creating-instance","title":"Creating Instance

                      SparkEnv takes the following to be created:

                      • Executor ID
                      • RpcEnv
                      • Serializer
                      • Serializer
                      • SerializerManager
                      • MapOutputTracker
                      • ShuffleManager
                      • BroadcastManager
                      • BlockManager
                      • SecurityManager
                      • MetricsSystem
                      • MemoryManager
                      • OutputCommitCoordinator
                      • SparkConf
                      • SparkEnv is created using create utility.

                        ","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#drivers-temporary-directory","title":"Driver's Temporary Directory
                        driverTmpDir: Option[String]\n

                        SparkEnv defines driverTmpDir internal registry for the driver to be used as the root directory of files added using SparkContext.addFile.

                        driverTmpDir is undefined initially and is defined for the driver only when SparkEnv utility is used to create a \"base\" SparkEnv.

                        ","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#demo","title":"Demo
                        import org.apache.spark.SparkEnv\n
                        // :pa -raw\n// BEGIN\npackage org.apache.spark\nobject BypassPrivateSpark {\n  def driverTmpDir(sparkEnv: SparkEnv) = {\n    sparkEnv.driverTmpDir\n  }\n}\n// END\n
                        val driverTmpDir = org.apache.spark.BypassPrivateSpark.driverTmpDir(SparkEnv.get).get\n

                        The above is equivalent to the following snippet.

                        import org.apache.spark.SparkFiles\nSparkFiles.getRootDirectory\n
                        ","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#creating-sparkenv-for-driver","title":"Creating SparkEnv for Driver
                        createDriverEnv(\n  conf: SparkConf,\n  isLocal: Boolean,\n  listenerBus: LiveListenerBus,\n  numCores: Int,\n  mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv\n

                        createDriverEnv creates a SparkEnv execution environment for the driver.

                        createDriverEnv accepts an instance of SparkConf, whether it runs in local mode or not, scheduler:LiveListenerBus.md[], the number of cores to use for execution in local mode or 0 otherwise, and a OutputCommitCoordinator (default: none).

                        createDriverEnv ensures that spark-driver.md#spark_driver_host[spark.driver.host] and spark-driver.md#spark_driver_port[spark.driver.port] settings are defined.

                        It then passes the call straight on to the <> (with driver executor id, isDriver enabled, and the input parameters).

                        createDriverEnv is used when SparkContext is created.

                        ","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#creating-sparkenv-for-executor","title":"Creating SparkEnv for Executor
                        createExecutorEnv(\n  conf: SparkConf,\n  executorId: String,\n  hostname: String,\n  numCores: Int,\n  ioEncryptionKey: Option[Array[Byte]],\n  isLocal: Boolean): SparkEnv\ncreateExecutorEnv(\n  conf: SparkConf,\n  executorId: String,\n  bindAddress: String,\n  hostname: String,\n  numCores: Int,\n  ioEncryptionKey: Option[Array[Byte]],\n  isLocal: Boolean): SparkEnv\n

                        createExecutorEnv creates an executor's (execution) environment that is the Spark execution environment for an executor.

                        createExecutorEnv simply <> (passing in all the input parameters) and <>.

                        NOTE: The number of cores numCores is configured using --cores command-line option of CoarseGrainedExecutorBackend and is specific to a cluster manager.

                        createExecutorEnv is used when CoarseGrainedExecutorBackend utility is requested to run.

                        ","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#creating-base-sparkenv","title":"Creating \"Base\" SparkEnv
                        create(\n  conf: SparkConf,\n  executorId: String,\n  bindAddress: String,\n  advertiseAddress: String,\n  port: Option[Int],\n  isLocal: Boolean,\n  numUsableCores: Int,\n  ioEncryptionKey: Option[Array[Byte]],\n  listenerBus: LiveListenerBus = null,\n  mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv\n

                        create creates the \"base\" SparkEnv (that is common across the driver and executors).

                        create creates a RpcEnv as sparkDriver on the driver and sparkExecutor on executors.

                        create creates a Serializer (based on spark.serializer configuration property). create prints out the following DEBUG message to the logs:

                        Using serializer: [serializer]\n

                        create creates a SerializerManager.

                        create creates a JavaSerializer as the closure serializer.

                        creates creates a BroadcastManager.

                        creates creates a MapOutputTrackerMaster (on the driver) or a MapOutputTrackerWorker (on executors). creates registers or looks up a MapOutputTrackerMasterEndpoint under the name of MapOutputTracker. creates prints out the following INFO message to the logs (on the driver only):

                        Registering MapOutputTracker\n

                        creates creates a ShuffleManager (based on spark.shuffle.manager configuration property).

                        create creates a UnifiedMemoryManager.

                        With spark.shuffle.service.enabled configuration property enabled, create creates an ExternalBlockStoreClient.

                        create creates a BlockManagerMaster.

                        create creates a NettyBlockTransferService.

                        create creates a BlockManager.

                        create creates a MetricsSystem.

                        create creates a OutputCommitCoordinator and registers or looks up a OutputCommitCoordinatorEndpoint under the name of OutputCommitCoordinator.

                        create creates a SparkEnv (with all the services \"stitched\" together).

                        ","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#logging","title":"Logging

                        Enable ALL logging level for org.apache.spark.SparkEnv logger to see what happens inside.

                        Add the following line to conf/log4j.properties:

                        log4j.logger.org.apache.spark.SparkEnv=ALL\n

                        Refer to Logging.

                        ","text":"","tags":["DeveloperApi"]},{"location":"SparkFiles/","title":"SparkFiles","text":"

                        SparkFiles is an utility to work with files added using SparkContext.addFile.

                        "},{"location":"SparkFiles/#absolute-path-of-added-file","title":"Absolute Path of Added File
                        get(\n  filename: String): String\n

                        get gets the absolute path of the given file in the root directory.

                        ","text":""},{"location":"SparkFiles/#root-directory","title":"Root Directory
                        getRootDirectory(): String\n

                        getRootDirectory requests the current SparkEnv for driverTmpDir (if defined) or defaults to the current directory (.).

                        getRootDirectory\u00a0is used when:

                        • SparkContext is requested to addFile
                        • Executor is requested to updateDependencies
                        • SparkFiles utility is requested to get the absolute path of a file
                        ","text":""},{"location":"SparkHadoopWriter/","title":"SparkHadoopWriter Utility","text":""},{"location":"SparkHadoopWriter/#writing-key-value-rdd-out-as-hadoop-outputformat","title":"Writing Key-Value RDD Out (As Hadoop OutputFormat)
                        write[K, V: ClassTag](\n  rdd: RDD[(K, V)],\n  config: HadoopWriteConfigUtil[K, V]): Unit\n

                        write runs a Spark job to write out partition records (for all partitions of the given key-value RDD) with the given HadoopWriteConfigUtil and a HadoopMapReduceCommitProtocol committer.

                        The number of writer tasks (parallelism) is the number of the partitions in the given key-value RDD.

                        ","text":""},{"location":"SparkHadoopWriter/#internals","title":"Internals

                        Internally, write uses the id of the given RDD as the commitJobId.

                        write creates a jobTrackerId with the current date.

                        write requests the given HadoopWriteConfigUtil to create a Hadoop JobContext (for the jobTrackerId and commitJobId).

                        write requests the given HadoopWriteConfigUtil to initOutputFormat with the Hadoop JobContext.

                        write requests the given HadoopWriteConfigUtil to assertConf.

                        write requests the given HadoopWriteConfigUtil to create a HadoopMapReduceCommitProtocol committer for the commitJobId.

                        write requests the HadoopMapReduceCommitProtocol to setupJob (with the jobContext).

                        write uses the SparkContext (of the given RDD) to run a Spark job asynchronously for the given RDD with the executeTask partition function.

                        In the end, write requests the HadoopMapReduceCommitProtocol to commit the job and prints out the following INFO message to the logs:

                        Job [getJobID] committed.\n
                        ","text":""},{"location":"SparkHadoopWriter/#throwables","title":"Throwables

                        In case of any Throwable, write prints out the following ERROR message to the logs:

                        Aborting job [getJobID].\n

                        write requests the HadoopMapReduceCommitProtocol to abort the job and throws a SparkException:

                        Job aborted.\n
                        ","text":""},{"location":"SparkHadoopWriter/#usage","title":"Usage

                        write\u00a0is used when:

                        • PairRDDFunctions.saveAsNewAPIHadoopDataset
                        • PairRDDFunctions.saveAsHadoopDataset
                        ","text":""},{"location":"SparkHadoopWriter/#writing-rdd-partition","title":"Writing RDD Partition
                        executeTask[K, V: ClassTag](\n  context: TaskContext,\n  config: HadoopWriteConfigUtil[K, V],\n  jobTrackerId: String,\n  commitJobId: Int,\n  sparkPartitionId: Int,\n  sparkAttemptNumber: Int,\n  committer: FileCommitProtocol,\n  iterator: Iterator[(K, V)]): TaskCommitMessage\n

                        Fixme

                        Review Me

                        executeTask requests the given HadoopWriteConfigUtil to create a TaskAttemptContext.

                        executeTask requests the given FileCommitProtocol to set up a task with the TaskAttemptContext.

                        executeTask requests the given HadoopWriteConfigUtil to initWriter (with the TaskAttemptContext and the given sparkPartitionId).

                        executeTask initHadoopOutputMetrics.

                        executeTask writes all rows of the RDD partition (from the given Iterator[(K, V)]). executeTask requests the given HadoopWriteConfigUtil to write. In the end, executeTask requests the given HadoopWriteConfigUtil to closeWriter and the given FileCommitProtocol to commit the task.

                        executeTask updates metrics about writing data to external systems (bytesWritten and recordsWritten) every few records and at the end.

                        In case of any errors, executeTask requests the given HadoopWriteConfigUtil to closeWriter and the given FileCommitProtocol to abort the task. In the end, executeTask prints out the following ERROR message to the logs:

                        Task [taskAttemptID] aborted.\n

                        executeTask is used when:

                        • SparkHadoopWriter utility is used to write
                        ","text":""},{"location":"SparkHadoopWriter/#logging","title":"Logging

                        Enable ALL logging level for org.apache.spark.internal.io.SparkHadoopWriter logger to see what happens inside.

                        Add the following line to conf/log4j.properties:

                        log4j.logger.org.apache.spark.internal.io.SparkHadoopWriter=ALL\n

                        Refer to Logging.

                        ","text":""},{"location":"SparkListener/","title":"SparkListener","text":"

                        SparkListener\u00a0is an extension of the SparkListenerInterface abstraction for event listeners with a no-op implementation for callback methods.

                        ","tags":["DeveloperApi"]},{"location":"SparkListener/#implementations","title":"Implementations","text":"
                        • BarrierCoordinator
                        • SparkSession (Spark SQL)
                        • AppListingListener (Spark History Server)
                        • AppStatusListener
                        • BasicEventFilterBuilder (Spark History Server)
                        • EventLoggingListener (Spark History Server)
                        • ExecutionListenerBus
                        • ExecutorAllocationListener
                        • ExecutorMonitor
                        • HeartbeatReceiver
                        • HiveThriftServer2Listener (Spark Thrift Server)
                        • SpillListener
                        • SQLAppStatusListener (Spark SQL)
                        • SQLEventFilterBuilder
                        • StatsReportListener
                        • StreamingQueryListenerBus (Spark Structured Streaming)
                        ","tags":["DeveloperApi"]},{"location":"SparkListenerBus/","title":"SparkListenerBus","text":"

                        SparkListenerBus\u00a0is an extension of the ListenerBus abstraction for event buses for SparkListenerInterfaces to be notified about SparkListenerEvents.

                        "},{"location":"SparkListenerBus/#posting-event-to-sparklistener","title":"Posting Event to SparkListener
                        doPostEvent(\n  listener: SparkListenerInterface,\n  event: SparkListenerEvent): Unit\n

                        doPostEvent\u00a0is part of the ListenerBus abstraction.

                        doPostEvent notifies the given SparkListenerInterface about the SparkListenerEvent.

                        doPostEvent calls an event-specific method of SparkListenerInterface or falls back to onOtherEvent.

                        ","text":""},{"location":"SparkListenerBus/#implementations","title":"Implementations","text":"
                        • AsyncEventQueue
                        • ReplayListenerBus
                        "},{"location":"SparkListenerEvent/","title":"SparkListenerEvent","text":"

                        SparkListenerEvent is an abstraction of scheduling events.

                        "},{"location":"SparkListenerEvent/#dispatching-sparklistenerevents","title":"Dispatching SparkListenerEvents","text":"

                        SparkListenerBus in general (and AsyncEventQueue are event buses used to dispatch SparkListenerEvents to registered SparkListeners.

                        LiveListenerBus is an event bus to dispatch SparkListenerEvents to registered SparkListeners.

                        "},{"location":"SparkListenerEvent/#spark-history-server","title":"Spark History Server","text":"

                        Once logged, Spark History Server uses JsonProtocol utility to sparkEventFromJson.

                        "},{"location":"SparkListenerEvent/#contract","title":"Contract","text":""},{"location":"SparkListenerEvent/#logevent","title":"logEvent
                        logEvent: Boolean\n

                        logEvent controls whether EventLoggingListener should save the event to an event log.

                        Default: true

                        logEvent\u00a0is used when:

                        • EventLoggingListener is requested to handle \"other\" events
                        ","text":""},{"location":"SparkListenerEvent/#implementations","title":"Implementations","text":""},{"location":"SparkListenerEvent/#sparklistenerapplicationend","title":"SparkListenerApplicationEnd","text":""},{"location":"SparkListenerEvent/#sparklistenerapplicationstart","title":"SparkListenerApplicationStart","text":""},{"location":"SparkListenerEvent/#sparklistenerblockmanageradded","title":"SparkListenerBlockManagerAdded","text":""},{"location":"SparkListenerEvent/#sparklistenerblockmanagerremoved","title":"SparkListenerBlockManagerRemoved","text":""},{"location":"SparkListenerEvent/#sparklistenerblockupdated","title":"SparkListenerBlockUpdated","text":""},{"location":"SparkListenerEvent/#sparklistenerenvironmentupdate","title":"SparkListenerEnvironmentUpdate","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutoradded","title":"SparkListenerExecutorAdded","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutorblacklisted","title":"SparkListenerExecutorBlacklisted","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutorblacklistedforstage","title":"SparkListenerExecutorBlacklistedForStage","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutormetricsupdate","title":"SparkListenerExecutorMetricsUpdate","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutorremoved","title":"SparkListenerExecutorRemoved","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutorunblacklisted","title":"SparkListenerExecutorUnblacklisted","text":""},{"location":"SparkListenerEvent/#sparklistenerjobend","title":"SparkListenerJobEnd","text":""},{"location":"SparkListenerEvent/#sparklistenerjobstart","title":"SparkListenerJobStart","text":""},{"location":"SparkListenerEvent/#sparklistenerlogstart","title":"SparkListenerLogStart","text":""},{"location":"SparkListenerEvent/#sparklistenernodeblacklisted","title":"SparkListenerNodeBlacklisted","text":""},{"location":"SparkListenerEvent/#sparklistenernodeblacklistedforstage","title":"SparkListenerNodeBlacklistedForStage","text":""},{"location":"SparkListenerEvent/#sparklistenernodeunblacklisted","title":"SparkListenerNodeUnblacklisted","text":""},{"location":"SparkListenerEvent/#sparklistenerspeculativetasksubmitted","title":"SparkListenerSpeculativeTaskSubmitted","text":""},{"location":"SparkListenerEvent/#sparklistenerstagecompleted","title":"SparkListenerStageCompleted","text":""},{"location":"SparkListenerEvent/#sparklistenerstageexecutormetrics","title":"SparkListenerStageExecutorMetrics","text":""},{"location":"SparkListenerEvent/#sparklistenerstagesubmitted","title":"SparkListenerStageSubmitted","text":""},{"location":"SparkListenerEvent/#sparklistenertaskend","title":"SparkListenerTaskEnd

                        SparkListenerTaskEnd

                        ","text":""},{"location":"SparkListenerEvent/#sparklistenertaskgettingresult","title":"SparkListenerTaskGettingResult","text":""},{"location":"SparkListenerEvent/#sparklistenertaskstart","title":"SparkListenerTaskStart","text":""},{"location":"SparkListenerEvent/#sparklistenerunpersistrdd","title":"SparkListenerUnpersistRDD","text":""},{"location":"SparkListenerInterface/","title":"SparkListenerInterface","text":"

                        SparkListenerInterface is an abstraction of event listeners (that SparkListenerBus notifies about scheduling events).

                        SparkListenerInterface is a way to intercept scheduling events from the Spark Scheduler that are emitted over the course of execution of a Spark application.

                        SparkListenerInterface is used heavily to manage communication between internal components in the distributed environment for a Spark application (e.g. web UI, event persistence for History Server, dynamic allocation of executors, keeping track of executors).

                        SparkListenerInterface can be registered in a Spark application using SparkContext.addSparkListener method or spark.extraListeners configuration property.

                        Tip

                        Enable INFO logging level for org.apache.spark.SparkContext logger to see what and when custom Spark listeners are registered.

                        "},{"location":"SparkListenerInterface/#onapplicationend","title":"onApplicationEnd
                        onApplicationEnd(\n  applicationEnd: SparkListenerApplicationEnd): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerApplicationEnd event
                        ","text":""},{"location":"SparkListenerInterface/#onapplicationstart","title":"onApplicationStart
                        onApplicationStart(\n  applicationStart: SparkListenerApplicationStart): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerApplicationStart event
                        ","text":""},{"location":"SparkListenerInterface/#onblockmanageradded","title":"onBlockManagerAdded
                        onBlockManagerAdded(\n  blockManagerAdded: SparkListenerBlockManagerAdded): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerBlockManagerAdded event
                        ","text":""},{"location":"SparkListenerInterface/#onblockmanagerremoved","title":"onBlockManagerRemoved
                        onBlockManagerRemoved(\n  blockManagerRemoved: SparkListenerBlockManagerRemoved): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerBlockManagerRemoved event
                        ","text":""},{"location":"SparkListenerInterface/#onblockupdated","title":"onBlockUpdated
                        onBlockUpdated(\n  blockUpdated: SparkListenerBlockUpdated): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerBlockUpdated event
                        ","text":""},{"location":"SparkListenerInterface/#onenvironmentupdate","title":"onEnvironmentUpdate
                        onEnvironmentUpdate(\n  environmentUpdate: SparkListenerEnvironmentUpdate): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerEnvironmentUpdate event
                        ","text":""},{"location":"SparkListenerInterface/#onexecutoradded","title":"onExecutorAdded
                        onExecutorAdded(\n  executorAdded: SparkListenerExecutorAdded): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerExecutorAdded event
                        ","text":""},{"location":"SparkListenerInterface/#onexecutorblacklisted","title":"onExecutorBlacklisted
                        onExecutorBlacklisted(\n  executorBlacklisted: SparkListenerExecutorBlacklisted): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerExecutorBlacklisted event
                        ","text":""},{"location":"SparkListenerInterface/#onexecutorblacklistedforstage","title":"onExecutorBlacklistedForStage
                        onExecutorBlacklistedForStage(\n  executorBlacklistedForStage: SparkListenerExecutorBlacklistedForStage): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerExecutorBlacklistedForStage event
                        ","text":""},{"location":"SparkListenerInterface/#onexecutormetricsupdate","title":"onExecutorMetricsUpdate
                        onExecutorMetricsUpdate(\n  executorMetricsUpdate: SparkListenerExecutorMetricsUpdate): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerExecutorMetricsUpdate event
                        ","text":""},{"location":"SparkListenerInterface/#onexecutorremoved","title":"onExecutorRemoved
                        onExecutorRemoved(\n  executorRemoved: SparkListenerExecutorRemoved): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerExecutorRemoved event
                        ","text":""},{"location":"SparkListenerInterface/#onexecutorunblacklisted","title":"onExecutorUnblacklisted
                        onExecutorUnblacklisted(\n  executorUnblacklisted: SparkListenerExecutorUnblacklisted): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerExecutorUnblacklisted event
                        ","text":""},{"location":"SparkListenerInterface/#onjobend","title":"onJobEnd
                        onJobEnd(\n  jobEnd: SparkListenerJobEnd): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerJobEnd event
                        ","text":""},{"location":"SparkListenerInterface/#onjobstart","title":"onJobStart
                        onJobStart(\n  jobStart: SparkListenerJobStart): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerJobStart event
                        ","text":""},{"location":"SparkListenerInterface/#onnodeblacklisted","title":"onNodeBlacklisted
                        onNodeBlacklisted(\n  nodeBlacklisted: SparkListenerNodeBlacklisted): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerNodeBlacklisted event
                        ","text":""},{"location":"SparkListenerInterface/#onnodeblacklistedforstage","title":"onNodeBlacklistedForStage
                        onNodeBlacklistedForStage(\n  nodeBlacklistedForStage: SparkListenerNodeBlacklistedForStage): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerNodeBlacklistedForStage event
                        ","text":""},{"location":"SparkListenerInterface/#onnodeunblacklisted","title":"onNodeUnblacklisted
                        onNodeUnblacklisted(\n  nodeUnblacklisted: SparkListenerNodeUnblacklisted): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerNodeUnblacklisted event
                        ","text":""},{"location":"SparkListenerInterface/#onotherevent","title":"onOtherEvent
                        onOtherEvent(\n  event: SparkListenerEvent): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a custom SparkListenerEvent
                        ","text":""},{"location":"SparkListenerInterface/#onspeculativetasksubmitted","title":"onSpeculativeTaskSubmitted
                        onSpeculativeTaskSubmitted(\n  speculativeTask: SparkListenerSpeculativeTaskSubmitted): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerSpeculativeTaskSubmitted event
                        ","text":""},{"location":"SparkListenerInterface/#onstagecompleted","title":"onStageCompleted
                        onStageCompleted(\n  stageCompleted: SparkListenerStageCompleted): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerStageCompleted event
                        ","text":""},{"location":"SparkListenerInterface/#onstageexecutormetrics","title":"onStageExecutorMetrics
                        onStageExecutorMetrics(\n  executorMetrics: SparkListenerStageExecutorMetrics): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerStageExecutorMetrics event
                        ","text":""},{"location":"SparkListenerInterface/#onstagesubmitted","title":"onStageSubmitted
                        onStageSubmitted(\n  stageSubmitted: SparkListenerStageSubmitted): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerStageSubmitted event
                        ","text":""},{"location":"SparkListenerInterface/#ontaskend","title":"onTaskEnd
                        onTaskEnd(\n  taskEnd: SparkListenerTaskEnd): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerTaskEnd event
                        ","text":""},{"location":"SparkListenerInterface/#ontaskgettingresult","title":"onTaskGettingResult
                        onTaskGettingResult(\n  taskGettingResult: SparkListenerTaskGettingResult): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerTaskGettingResult event
                        ","text":""},{"location":"SparkListenerInterface/#ontaskstart","title":"onTaskStart
                        onTaskStart(\n  taskStart: SparkListenerTaskStart): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerTaskStart event
                        ","text":""},{"location":"SparkListenerInterface/#onunpersistrdd","title":"onUnpersistRDD
                        onUnpersistRDD(\n  unpersistRDD: SparkListenerUnpersistRDD): Unit\n

                        Used when:

                        • SparkListenerBus is requested to post a SparkListenerUnpersistRDD event
                        ","text":""},{"location":"SparkListenerInterface/#implementations","title":"Implementations
                        • EventFilterBuilder
                        • SparkFirehoseListener
                        • SparkListener
                        ","text":""},{"location":"SparkListenerTaskEnd/","title":"SparkListenerTaskEnd","text":"

                        SparkListenerTaskEnd is a SparkListenerEvent.

                        SparkListenerTaskEnd is posted (and created) when:

                        • DAGScheduler is requested to postTaskEnd

                        SparkListenerTaskEnd is intercepted using SparkListenerInterface.onTaskEnd

                        "},{"location":"SparkListenerTaskEnd/#creating-instance","title":"Creating Instance","text":"

                        SparkListenerTaskEnd takes the following to be created:

                        • Stage ID
                        • Stage Attempt ID
                        • Task Type
                        • TaskEndReason
                        • TaskInfo
                        • ExecutorMetrics
                        • TaskMetrics"},{"location":"SparkStatusTracker/","title":"SparkStatusTracker","text":"

                          SparkStatusTracker is created for SparkContext for Spark developers to access the AppStatusStore and the following:

                          • All active job IDs
                          • All active stage IDs
                          • All known job IDs (and possibly limited to a particular job group)
                          • SparkExecutorInfos of all known executors
                          • SparkJobInfo of a job ID
                          • SparkStageInfo of a stage ID
                          "},{"location":"SparkStatusTracker/#creating-instance","title":"Creating Instance","text":"

                          SparkStatusTracker takes the following to be created:

                          • SparkContext (unused)
                          • AppStatusStore

                            SparkStatusTracker is created\u00a0when:

                            • SparkContext is created
                            "},{"location":"SpillListener/","title":"SpillListener","text":"

                            SpillListener is a SparkListener that intercepts (listens to) the following events for detecting spills in jobs:

                            • onTaskEnd
                            • onStageCompleted

                            SpillListener is used for testing only.

                            "},{"location":"SpillListener/#creating-instance","title":"Creating Instance","text":"

                            SpillListener takes no input arguments to be created.

                            SpillListener is created when TestUtils is requested to assertSpilled and assertNotSpilled.

                            "},{"location":"SpillListener/#ontaskend-callback","title":"onTaskEnd Callback
                            onTaskEnd(\n  taskEnd: SparkListenerTaskEnd): Unit\n

                            onTaskEnd...FIXME

                            onTaskEnd is part of the SparkListener abstraction.

                            ","text":""},{"location":"SpillListener/#onstagecompleted-callback","title":"onStageCompleted Callback
                            onStageCompleted(\n  stageComplete: SparkListenerStageCompleted): Unit\n

                            onStageCompleted...FIXME

                            onStageCompleted is part of the SparkListener abstraction.

                            ","text":""},{"location":"StatsReportListener/","title":"StatsReportListener \u2014 Logging Summary Statistics","text":"

                            org.apache.spark.scheduler.StatsReportListener (see https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.StatsReportListener[the listener's scaladoc]) is a SparkListener.md[] that logs summary statistics when each stage completes.

                            StatsReportListener listens to SparkListenerTaskEnd and SparkListenerStageCompleted events and prints them out at INFO logging level.

                            ","tags":["DeveloperApi"]},{"location":"StatsReportListener/#tip","title":"[TIP]","text":"

                            Enable INFO logging level for org.apache.spark.scheduler.StatsReportListener logger to see Spark events.

                            Add the following line to conf/log4j.properties:

                            log4j.logger.org.apache.spark.scheduler.StatsReportListener=INFO\n
                            ","tags":["DeveloperApi"]},{"location":"StatsReportListener/#refer-to-spark-loggingmdlogging","title":"Refer to spark-logging.md[Logging].","text":"

                            === [[onStageCompleted]] Intercepting Stage Completed Events -- onStageCompleted Callback

                            CAUTION: FIXME

                            === [[example]] Example

                            $ ./bin/spark-shell -c spark.extraListeners=org.apache.spark.scheduler.StatsReportListener\n...\nINFO SparkContext: Registered listener org.apache.spark.scheduler.StatsReportListener\n...\n\nscala> spark.read.text(\"README.md\").count\n...\nINFO StatsReportListener: Finished stage: Stage(0, 0); Name: 'count at <console>:24'; Status: succeeded; numTasks: 1; Took: 212 msec\nINFO StatsReportListener: task runtime:(count: 1, mean: 198.000000, stdev: 0.000000, max: 198.000000, min: 198.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms\nINFO StatsReportListener: shuffle bytes written:(count: 1, mean: 59.000000, stdev: 0.000000, max: 59.000000, min: 59.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   59.0 B  59.0 B  59.0 B  59.0 B  59.0 B  59.0 B  59.0 B  59.0 B  59.0 B\nINFO StatsReportListener: fetch wait time:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms\nINFO StatsReportListener: remote bytes read:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B\nINFO StatsReportListener: task result size:(count: 1, mean: 1885.000000, stdev: 0.000000, max: 1885.000000, min: 1885.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B\nINFO StatsReportListener: executor (non-fetch) time pct: (count: 1, mean: 73.737374, stdev: 0.000000, max: 73.737374, min: 73.737374)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   74 %    74 %    74 %    74 %    74 %    74 %    74 %    74 %    74 %\nINFO StatsReportListener: fetch wait time pct: (count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:    0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %\nINFO StatsReportListener: other time pct: (count: 1, mean: 26.262626, stdev: 0.000000, max: 26.262626, min: 26.262626)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   26 %    26 %    26 %    26 %    26 %    26 %    26 %    26 %    26 %\nINFO StatsReportListener: Finished stage: Stage(1, 0); Name: 'count at <console>:24'; Status: succeeded; numTasks: 1; Took: 34 msec\nINFO StatsReportListener: task runtime:(count: 1, mean: 33.000000, stdev: 0.000000, max: 33.000000, min: 33.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms\nINFO StatsReportListener: shuffle bytes written:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B\nINFO StatsReportListener: fetch wait time:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms\nINFO StatsReportListener: remote bytes read:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B\nINFO StatsReportListener: task result size:(count: 1, mean: 1960.000000, stdev: 0.000000, max: 1960.000000, min: 1960.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B\nINFO StatsReportListener: executor (non-fetch) time pct: (count: 1, mean: 75.757576, stdev: 0.000000, max: 75.757576, min: 75.757576)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   76 %    76 %    76 %    76 %    76 %    76 %    76 %    76 %    76 %\nINFO StatsReportListener: fetch wait time pct: (count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:    0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %\nINFO StatsReportListener: other time pct: (count: 1, mean: 24.242424, stdev: 0.000000, max: 24.242424, min: 24.242424)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   24 %    24 %    24 %    24 %    24 %    24 %    24 %    24 %    24 %\nres0: Long = 99\n
                            ","tags":["DeveloperApi"]},{"location":"TaskCompletionListener/","title":"TaskCompletionListener","text":"

                            TaskCompletionListener\u00a0is an extension of the EventListener (Java) abstraction for task listeners that can be notified on task completion.

                            ","tags":["DeveloperApi"]},{"location":"TaskCompletionListener/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"TaskCompletionListener/#ontaskcompletion","title":"onTaskCompletion
                            onTaskCompletion(\n  context: TaskContext): Unit\n

                            Used when:

                            • TaskContextImpl is requested to addTaskCompletionListener (and a task has already completed) and markTaskCompleted
                            • ShuffleFetchCompletionListener is requested to onComplete
                            ","text":"","tags":["DeveloperApi"]},{"location":"TaskFailureListener/","title":"TaskFailureListener","text":"

                            TaskFailureListener\u00a0is an extension of the EventListener (Java) abstraction for task listeners that can be notified on task failure.

                            ","tags":["DeveloperApi"]},{"location":"TaskFailureListener/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"TaskFailureListener/#ontaskfailure","title":"onTaskFailure
                            onTaskFailure(\n  context: TaskContext,\n  error: Throwable): Unit\n

                            Used when:

                            • TaskContextImpl is requested to addTaskFailureListener (and a task has already failed) and markTaskFailed
                            ","text":"","tags":["DeveloperApi"]},{"location":"Utils/","title":"Utils Utility","text":""},{"location":"Utils/#getdynamicallocationinitialexecutors","title":"getDynamicAllocationInitialExecutors
                            getDynamicAllocationInitialExecutors(\n  conf: SparkConf): Int\n

                            getDynamicAllocationInitialExecutors gives the maximum value of the following configuration properties (for the initial number of executors):

                            • spark.dynamicAllocation.initialExecutors
                            • spark.dynamicAllocation.minExecutors
                            • spark.executor.instances

                            getDynamicAllocationInitialExecutors prints out the following INFO message to the logs:

                            Using initial executors = [initialExecutors],\nmax of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances\n

                            With spark.dynamicAllocation.initialExecutors less than spark.dynamicAllocation.minExecutors, getDynamicAllocationInitialExecutors prints out the following WARN message to the logs:

                            spark.dynamicAllocation.initialExecutors less than spark.dynamicAllocation.minExecutors is invalid,\nignoring its setting, please update your configs.\n

                            With spark.executor.instances less than spark.dynamicAllocation.minExecutors, getDynamicAllocationInitialExecutors prints out the following WARN message to the logs:

                            spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid,\nignoring its setting, please update your configs.\n

                            getDynamicAllocationInitialExecutors is used when:

                            • ExecutorAllocationManager is created
                            • SchedulerBackendUtils utility is used to getInitialTargetExecutorNumber
                            ","text":""},{"location":"Utils/#local-directories-for-scratch-space","title":"Local Directories for Scratch Space
                            getConfiguredLocalDirs(\n  conf: SparkConf): Array[String]\n

                            getConfiguredLocalDirs returns the local directories where Spark can write files to.

                            getConfiguredLocalDirs uses the given SparkConf to know if External Shuffle Service is enabled or not (based on spark.shuffle.service.enabled configuration property).

                            When in a YARN container (CONTAINER_ID), getConfiguredLocalDirs uses LOCAL_DIRS environment variable for YARN-approved local directories.

                            In non-YARN mode (or for the driver in yarn-client mode), getConfiguredLocalDirs checks the following environment variables (in order) and returns the value of the first found:

                            1. SPARK_EXECUTOR_DIRS
                            2. SPARK_LOCAL_DIRS
                            3. MESOS_DIRECTORY (only when External Shuffle Service is not used)

                            The environment variables are a comma-separated list of local directory paths.

                            In the end, when no earlier environment variables were found, getConfiguredLocalDirs uses spark.local.dir configuration property (with java.io.tmpdir System property as the default value).

                            getConfiguredLocalDirs is used when:

                            • DiskBlockManager is requested to createLocalDirs and createLocalDirsForMergedShuffleBlocks
                            • Utils utility is used to get a single random local root directory and create a spark directory in every local root directory
                            ","text":""},{"location":"Utils/#random-local-directory-path","title":"Random Local Directory Path
                            getLocalDir(\n  conf: SparkConf): String\n

                            getLocalDir takes a random directory path out of the configured local root directories

                            getLocalDir throws an IOException if no local directory is defined:

                            Failed to get a temp directory under [[configuredLocalDirs]].\n

                            getLocalDir is used when:

                            • SparkEnv utility is used to create a base SparkEnv for the driver
                            • Utils utility is used to fetchFile
                            • DriverLogger is created
                            • RocksDBStateStoreProvider (Spark Structured Streaming) is requested for a RocksDB
                            • PythonBroadcast (PySpark) is requested to readObject
                            • AggregateInPandasExec (PySpark) is requested to doExecute
                            • EvalPythonExec (PySpark) is requested to doExecute
                            • WindowInPandasExec (PySpark) is requested to doExecute
                            • PythonForeachWriter (PySpark) is requested for a UnsafeRowBuffer
                            • Client (Spark on YARN) is requested to prepareLocalResources and createConfArchive
                            ","text":""},{"location":"Utils/#localrootdirs-registry","title":"localRootDirs Registry

                            Utils utility uses localRootDirs internal registry so getOrCreateLocalRootDirsImpl is executed just once (when first requested).

                            localRootDirs is available using getOrCreateLocalRootDirs method.

                            getOrCreateLocalRootDirs(\n  conf: SparkConf): Array[String]\n

                            getOrCreateLocalRootDirs is used when:

                            • Utils is used to getLocalDir
                            • Worker (Spark Standalone) is requested to launch an executor
                            ","text":""},{"location":"Utils/#creating-spark-directory-in-every-local-root-directory","title":"Creating spark Directory in Every Local Root Directory
                            getOrCreateLocalRootDirsImpl(\n  conf: SparkConf): Array[String]\n

                            getOrCreateLocalRootDirsImpl creates a spark-[randomUUID] directory under every root directory for local storage (and registers a shutdown hook to delete the directories at shutdown).

                            getOrCreateLocalRootDirsImpl prints out the following WARN message to the logs when there is a local root directories as a URI (with a scheme):

                            The configured local directories are not expected to be URIs;\nhowever, got suspicious values [[uris]].\nPlease check your configured local directories.\n
                            ","text":""},{"location":"Utils/#local-uri-scheme","title":"Local URI Scheme

                            Utils defines a local URI scheme for files that are locally available on worker nodes in the cluster.

                            The local URL scheme is used when:

                            • Utils is used to isLocalUri
                            • Client (Spark on YARN) is used
                            ","text":""},{"location":"Utils/#islocaluri","title":"isLocalUri
                            isLocalUri(\n  uri: String): Boolean\n

                            isLocalUri is true when the URI is a local: URI (the given uri starts with local: scheme).

                            isLocalUri is used when:

                            • FIXME
                            ","text":""},{"location":"Utils/#getcurrentusername","title":"getCurrentUserName
                            getCurrentUserName(): String\n

                            getCurrentUserName computes the user name who has started the SparkContext.md[SparkContext] instance.

                            NOTE: It is later available as SparkContext.md#sparkUser[SparkContext.sparkUser].

                            Internally, it reads SparkContext.md#SPARK_USER[SPARK_USER] environment variable and, if not set, reverts to Hadoop Security API's UserGroupInformation.getCurrentUser().getShortUserName().

                            NOTE: It is another place where Spark relies on Hadoop API for its operation.

                            ","text":""},{"location":"Utils/#localhostname","title":"localHostName
                            localHostName(): String\n

                            localHostName computes the local host name.

                            It starts by checking SPARK_LOCAL_HOSTNAME environment variable for the value. If it is not defined, it uses SPARK_LOCAL_IP to find the name (using InetAddress.getByName). If it is not defined either, it calls InetAddress.getLocalHost for the name.

                            NOTE: Utils.localHostName is executed while SparkContext.md#creating-instance[SparkContext is created] and also to compute the default value of spark-driver.md#spark_driver_host[spark.driver.host Spark property].

                            ","text":""},{"location":"Utils/#getuserjars","title":"getUserJars
                            getUserJars(\n  conf: SparkConf): Seq[String]\n

                            getUserJars is the spark.jars configuration property with non-empty entries.

                            getUserJars is used when:

                            • SparkContext is created
                            ","text":""},{"location":"Utils/#extracthostportfromsparkurl","title":"extractHostPortFromSparkUrl
                            extractHostPortFromSparkUrl(\n  sparkUrl: String): (String, Int)\n

                            extractHostPortFromSparkUrl creates a Java URI with the input sparkUrl and takes the host and port parts.

                            extractHostPortFromSparkUrl asserts that the input sparkURL uses spark scheme.

                            extractHostPortFromSparkUrl throws a SparkException for unparseable spark URLs:

                            Invalid master URL: [sparkUrl]\n

                            extractHostPortFromSparkUrl is used when:

                            • StandaloneSubmitRequestServlet is requested to buildDriverDescription
                            • RpcAddress is requested to extract an RpcAddress from a Spark master URL
                            ","text":""},{"location":"Utils/#isDynamicAllocationEnabled","title":"isDynamicAllocationEnabled
                            isDynamicAllocationEnabled(\n  conf: SparkConf): Boolean\n

                            isDynamicAllocationEnabled checks whether Dynamic Allocation of Executors is enabled (true) or not (false).

                            isDynamicAllocationEnabled is positive (true) when all the following hold:

                            1. spark.dynamicAllocation.enabled configuration property is true
                            2. spark.master is non-local

                            isDynamicAllocationEnabled is used when:

                            • SparkContext is created (to start an ExecutorAllocationManager)
                            • TaskResourceProfile is requested for custom executor resources
                            • ResourceProfileManager is created
                            • DAGScheduler is requested to checkBarrierStageWithDynamicAllocation
                            • TaskSchedulerImpl is requested to resourceOffers
                            • SchedulerBackendUtils is requested to getInitialTargetExecutorNumber
                            • StandaloneSchedulerBackend (Spark Standalone) is requested to start (for reporting purposes)
                            • ExecutorPodsAllocator (Spark on Kubernetes) is created (maxPVCs)
                            • ApplicationMaster (Spark on YARN) is created (maxNumExecutorFailures)
                            • YarnSchedulerBackend (Spark on YARN) is requested to getShufflePushMergerLocations
                            ","text":""},{"location":"Utils/#checkandgetk8smasterurl","title":"checkAndGetK8sMasterUrl
                            checkAndGetK8sMasterUrl(\n  rawMasterURL: String): String\n

                            checkAndGetK8sMasterUrl...FIXME

                            checkAndGetK8sMasterUrl is used when:

                            • SparkSubmit is requested to prepareSubmitEnvironment (for Kubernetes cluster manager)
                            ","text":""},{"location":"Utils/#fetching-file","title":"Fetching File
                            fetchFile(\n  url: String,\n  targetDir: File,\n  conf: SparkConf,\n  securityMgr: SecurityManager,\n  hadoopConf: Configuration,\n  timestamp: Long,\n  useCache: Boolean): File\n

                            fetchFile...FIXME

                            fetchFile is used when:

                            • SparkContext is requested to SparkContext.md#addFile[addFile]

                            • Executor is requested to executor:Executor.md#updateDependencies[updateDependencies]

                            • Spark Standalone's DriverRunner is requested to downloadUserJar

                            ","text":""},{"location":"Utils/#ispushbasedshuffleenabled","title":"isPushBasedShuffleEnabled
                            isPushBasedShuffleEnabled(\n  conf: SparkConf,\n  isDriver: Boolean,\n  checkSerializer: Boolean = true): Boolean\n

                            isPushBasedShuffleEnabled takes the value of spark.shuffle.push.enabled configuration property (from the given SparkConf).

                            If false, isPushBasedShuffleEnabled does nothing and returns false as well.

                            Otherwise, isPushBasedShuffleEnabled returns whether it is even possible to use push-based shuffle or not based on the following:

                            1. External Shuffle Service is used (based on spark.shuffle.service.enabled that should be true)
                            2. spark.master is yarn
                            3. (only with checkSerializer enabled) spark.serializer is a Serializer that supportsRelocationOfSerializedObjects
                            4. spark.io.encryption.enabled is false

                            In case spark.shuffle.push.enabled configuration property is enabled but the above requirements did not hold, isPushBasedShuffleEnabled prints out the following WARN message to the logs:

                            Push-based shuffle can only be enabled\nwhen the application is submitted to run in YARN mode,\nwith external shuffle service enabled, IO encryption disabled,\nand relocation of serialized objects supported.\n

                            isPushBasedShuffleEnabled\u00a0is used when:

                            • ShuffleDependency is requested to canShuffleMergeBeEnabled
                            • MapOutputTrackerMaster is created
                            • MapOutputTrackerWorker is created
                            • DAGScheduler is created
                            • ShuffleBlockPusher utility is used to create a BLOCK_PUSHER_POOL thread pool
                            • BlockManager is requested to initialize and registerWithExternalShuffleServer
                            • BlockManagerMasterEndpoint is created
                            • DiskBlockManager is requested to createLocalDirsForMergedShuffleBlocks
                            ","text":""},{"location":"Utils/#logging","title":"Logging

                            Enable ALL logging level for org.apache.spark.util.Utils logger to see what happens inside.

                            Add the following line to conf/log4j.properties:

                            log4j.logger.org.apache.spark.util.Utils=ALL\n

                            Refer to Logging.

                            ","text":""},{"location":"architecture/","title":"Architecture","text":"

                            = Spark Architecture

                            Spark uses a master/worker architecture. There is a spark-driver.md[driver] that talks to a single coordinator called spark-master.md[master] that manages spark-workers.md[workers] in which executor:Executor.md[executors] run.

                            .Spark architecture image::driver-sparkcontext-clustermanager-workers-executors.png[align=\"center\"]

                            The driver and the executors run in their own Java processes. You can run them all on the same (horizontal cluster) or separate machines (vertical cluster) or in a mixed machine configuration.

                            .Spark architecture in detail image::sparkapp-sparkcontext-master-slaves.png[align=\"center\"]

                            Physical machines are called hosts or nodes.

                            "},{"location":"configuration-properties/","title":"Configuration Properties","text":""},{"location":"configuration-properties/#sparkappid","title":"spark.app.id

                            Unique identifier of a Spark application that Spark uses to uniquely identify metric sources.

                            Default: TaskScheduler.applicationId()

                            Set when SparkContext is created

                            ","text":""},{"location":"configuration-properties/#sparkbroadcastblocksize","title":"spark.broadcast.blockSize

                            The size of each piece of a block (in kB unless the unit is specified)

                            Default: 4m

                            Too large a value decreases parallelism during broadcast (makes it slower); however, if it is too small, BlockManager might take a performance hit

                            Used when:

                            • TorrentBroadcast is requested to setConf
                            ","text":""},{"location":"configuration-properties/#sparkbroadcastcompress","title":"spark.broadcast.compress

                            Controls broadcast variable compression (before sending them over the wire)

                            Default: true

                            Generally a good idea. Compression will use spark.io.compression.codec

                            Used when:

                            • TorrentBroadcast is requested to setConf
                            • SerializerManager is created
                            ","text":""},{"location":"configuration-properties/#spark.buffer.pageSize","title":"spark.buffer.pageSize

                            spark.buffer.pageSize

                            The amount of memory used per page (in bytes)

                            Default: (undefined)

                            Used when:

                            • MemoryManager is created
                            ","text":""},{"location":"configuration-properties/#sparkcleanerreferencetracking","title":"spark.cleaner.referenceTracking

                            Controls whether to enable ContextCleaner

                            Default: true

                            ","text":""},{"location":"configuration-properties/#sparkdiskstoresubdirectories","title":"spark.diskStore.subDirectories

                            Number of subdirectories inside each path listed in spark.local.dir for hashing block files into.

                            Default: 64

                            Used by BlockManager and DiskBlockManager

                            ","text":""},{"location":"configuration-properties/#sparkdriverhost","title":"spark.driver.host

                            Address of the driver (endpoints)

                            Default: Utils.localCanonicalHostName

                            ","text":""},{"location":"configuration-properties/#sparkdriverlogallowerasurecoding","title":"spark.driver.log.allowErasureCoding

                            Default: false

                            Used when:

                            • DfsAsyncWriter is requested to init
                            ","text":""},{"location":"configuration-properties/#sparkdriverlogdfsdir","title":"spark.driver.log.dfsDir

                            The directory on a Hadoop DFS-compliant file system where DriverLogger copies driver logs to

                            Default: (undefined)

                            Used when:

                            • FsHistoryProvider is requested to startPolling (and cleanDriverLogs)
                            • DfsAsyncWriter is requested to init
                            • DriverLogger utility is used to create a DriverLogger (for a SparkContext)
                            ","text":""},{"location":"configuration-properties/#sparkdriverlogpersisttodfsenabled","title":"spark.driver.log.persistToDfs.enabled

                            Enables DriverLogger

                            Default: false

                            Used when:

                            • DriverLogger utility is used to create a DriverLogger (for a SparkContext)
                            ","text":""},{"location":"configuration-properties/#sparkdrivermaxresultsize","title":"spark.driver.maxResultSize

                            Maximum size of task results (in bytes)

                            Default: 1g

                            Used when:

                            • TaskRunner is requested to run a task (and decide on the type of a serialized task result)

                            • TaskSetManager is requested to check available memory for task results

                            ","text":""},{"location":"configuration-properties/#sparkdriverport","title":"spark.driver.port

                            Port of the driver (endpoints)

                            Default: 0

                            ","text":""},{"location":"configuration-properties/#sparkexecutorcores","title":"spark.executor.cores

                            Number of CPU cores for Executor

                            Default: 1

                            ","text":""},{"location":"configuration-properties/#sparkexecutorheartbeatmaxfailures","title":"spark.executor.heartbeat.maxFailures

                            Number of times an Executor tries sending heartbeats to the driver before it gives up and exits (with exit code 56).

                            Default: 60

                            For example, with max failures 60 (the default) and spark.executor.heartbeatInterval 10s, then Executor will try to send heartbeats for up to 600s (10 minutes).

                            Used when:

                            • Executor is created (and reportHeartBeat)
                            ","text":""},{"location":"configuration-properties/#sparkexecutorheartbeatinterval","title":"spark.executor.heartbeatInterval

                            Interval between Executor heartbeats (to the driver)

                            Default: 10s

                            Used when:

                            • SparkContext is created
                            • Executor is created and requested to reportHeartBeat
                            • HeartbeatReceiver is created
                            ","text":""},{"location":"configuration-properties/#sparkexecutorid","title":"spark.executor.id

                            Default: (undefined)

                            ","text":""},{"location":"configuration-properties/#sparkexecutorinstances","title":"spark.executor.instances

                            Number of executors to use

                            Default: (undefined)

                            ","text":""},{"location":"configuration-properties/#sparkexecutormemory","title":"spark.executor.memory

                            Amount of memory to use for an Executor

                            Default: 1g

                            Equivalent to SPARK_EXECUTOR_MEMORY environment variable.

                            ","text":""},{"location":"configuration-properties/#sparkexecutormemoryoverhead","title":"spark.executor.memoryOverhead

                            The amount of non-heap memory (in MiB) to be allocated per executor

                            Used when:

                            • ResourceProfile is requested for the default executor resources
                            • Client (Spark on YARN) is created
                            ","text":""},{"location":"configuration-properties/#sparkexecutormetricsfilesystemschemes","title":"spark.executor.metrics.fileSystemSchemes

                            A comma-separated list of the file system schemes to report in executor metrics

                            Default: file,hdfs

                            ","text":""},{"location":"configuration-properties/#sparkexecutormetricspollinginterval","title":"spark.executor.metrics.pollingInterval

                            How often to collect executor metrics (in ms):

                            • 0 - the polling is done on executor heartbeats
                            • A positive number - the polling is done at this interval

                            Default: 0

                            Used when:

                            • Executor is created
                            ","text":""},{"location":"configuration-properties/#sparkexecutoruserclasspathfirst","title":"spark.executor.userClassPathFirst

                            Controls whether to load classes in user-defined jars before those in Spark jars

                            Default: false

                            Used when:

                            • CoarseGrainedExecutorBackend is requested to create a ClassLoader
                            • Executor is created
                            • Client utility (Spark on YARN) is used to isUserClassPathFirst
                            ","text":""},{"location":"configuration-properties/#sparkextralisteners","title":"spark.extraListeners

                            A comma-separated list of fully-qualified class names of SparkListeners (to be registered when SparkContext is created)

                            Default: (empty)

                            ","text":""},{"location":"configuration-properties/#sparkfiletransferto","title":"spark.file.transferTo

                            Controls whether to use Java FileChannels (Java NIO) for copying data between two Java FileInputStreams to improve copy performance

                            Default: true

                            Used when:

                            • BypassMergeSortShuffleWriter and UnsafeShuffleWriter are created
                            ","text":""},{"location":"configuration-properties/#sparkfiles","title":"spark.files

                            The files to be added to a Spark application (that can be defined directly as a configuration property or indirectly using --files option of spark-submit script)

                            Default: (empty)

                            Used when:

                            • SparkContext is created
                            ","text":""},{"location":"configuration-properties/#sparkioencryptionenabled","title":"spark.io.encryption.enabled

                            Controls local disk I/O encryption

                            Default: false

                            Used when:

                            • SparkEnv utility is used to create a SparkEnv for the driver (to create a IO encryption key)
                            • BlockStoreShuffleReader is requested to read combined records (and fetchContinuousBlocksInBatch)
                            ","text":""},{"location":"configuration-properties/#sparkjars","title":"spark.jars

                            Default: (empty)

                            ","text":""},{"location":"configuration-properties/#sparkkryopool","title":"spark.kryo.pool

                            Default: true

                            Used when:

                            • KryoSerializer is created
                            ","text":""},{"location":"configuration-properties/#sparkkryounsafe","title":"spark.kryo.unsafe

                            Whether KryoSerializer should use Unsafe-based IO for serialization

                            Default: false

                            ","text":""},{"location":"configuration-properties/#sparklocaldir","title":"spark.local.dir

                            A comma-separated list of directory paths for \"scratch\" space (a temporary storage for map output files, RDDs that get stored on disk, etc.). It is recommended to use paths on fast local disks in your system (e.g. SSDs).

                            Default: java.io.tmpdir System property

                            ","text":""},{"location":"configuration-properties/#sparklocalitywait","title":"spark.locality.wait

                            How long to wait until an executor is available for locality-aware delay scheduling (for PROCESS_LOCAL, NODE_LOCAL, and RACK_LOCAL TaskLocalities) unless locality-specific setting is set (i.e., spark.locality.wait.process, spark.locality.wait.node, and spark.locality.wait.rack, respectively)

                            Default: 3s

                            ","text":""},{"location":"configuration-properties/#sparklocalitywaitlegacyresetontasklaunch","title":"spark.locality.wait.legacyResetOnTaskLaunch

                            (internal) Whether to use the legacy behavior of locality wait, which resets the delay timer anytime a task is scheduled.

                            Default: false

                            Used when:

                            • TaskSchedulerImpl is created
                            • TaskSetManager is created
                            ","text":""},{"location":"configuration-properties/#sparklocalitywaitnode","title":"spark.locality.wait.node

                            Scheduling delay for TaskLocality.NODE_LOCAL

                            Default: spark.locality.wait

                            Used when:

                            • TaskSetManager is requested for the locality wait (of TaskLocality.NODE_LOCAL)
                            ","text":""},{"location":"configuration-properties/#sparklocalitywaitprocess","title":"spark.locality.wait.process

                            Scheduling delay for TaskLocality.PROCESS_LOCAL

                            Default: spark.locality.wait

                            Used when:

                            • TaskSetManager is requested for the locality wait (of TaskLocality.PROCESS_LOCAL)
                            ","text":""},{"location":"configuration-properties/#sparklocalitywaitrack","title":"spark.locality.wait.rack

                            Scheduling delay for TaskLocality.RACK_LOCAL

                            Default: spark.locality.wait

                            Used when:

                            • TaskSetManager is requested for the locality wait (of TaskLocality.RACK_LOCAL)
                            ","text":""},{"location":"configuration-properties/#sparklogconf","title":"spark.logConf

                            Default: false

                            ","text":""},{"location":"configuration-properties/#sparkloglineage","title":"spark.logLineage

                            Enables printing out the RDD lineage graph (using RDD.toDebugString) when executing an action (and running a job)

                            Default: false

                            ","text":""},{"location":"configuration-properties/#sparkmaster","title":"spark.master

                            Master URL of the cluster manager to connect the Spark application to

                            ","text":""},{"location":"configuration-properties/#sparkmemoryfraction","title":"spark.memory.fraction

                            Fraction of JVM heap space used for execution and storage.

                            Default: 0.6

                            The lower the more frequent spills and cached data eviction. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records. Leaving this at the default value is recommended.

                            ","text":""},{"location":"configuration-properties/#sparkmemoryoffheapenabled","title":"spark.memory.offHeap.enabled

                            Controls whether Tungsten memory will be allocated on the JVM heap (false) or off-heap (true / using sun.misc.Unsafe).

                            Default: false

                            When enabled, spark.memory.offHeap.size must be greater than 0.

                            Used when:

                            • MemoryManager is requested for tungstenMemoryMode
                            ","text":""},{"location":"configuration-properties/#sparkmemoryoffheapsize","title":"spark.memory.offHeap.size

                            Maximum memory (in bytes) for off-heap memory allocation

                            Default: 0

                            This setting has no impact on heap memory usage, so if your executors' total memory consumption must fit within some hard limit then be sure to shrink your JVM heap size accordingly.

                            Must not be negative and be set to a positive value when spark.memory.offHeap.enabled is enabled

                            ","text":""},{"location":"configuration-properties/#sparkmemorystoragefraction","title":"spark.memory.storageFraction

                            Amount of storage memory immune to eviction, expressed as a fraction of the size of the region set aside by spark.memory.fraction.

                            Default: 0.5

                            The higher the less working memory may be available to execution and tasks may spill to disk more often. The default value is recommended.

                            Must be in [0,1)

                            Used when:

                            • UnifiedMemoryManager is created
                            • MemoryManager is created
                            ","text":""},{"location":"configuration-properties/#sparknetworkiopreferdirectbufs","title":"spark.network.io.preferDirectBufs

                            Default: true

                            ","text":""},{"location":"configuration-properties/#sparknetworkmaxremoteblocksizefetchtomem","title":"spark.network.maxRemoteBlockSizeFetchToMem

                            Remote block will be fetched to disk when size of the block is above this threshold in bytes

                            This is to avoid a giant request takes too much memory. Note this configuration will affect both shuffle fetch and block manager remote block fetch.

                            With an external shuffle service use at least 2.3.0

                            Default: 200m

                            Used when:

                            • BlockStoreShuffleReader is requested to read combined records for a reduce task
                            • NettyBlockTransferService is requested to uploadBlock
                            • BlockManager is requested to fetchRemoteManagedBuffer
                            ","text":""},{"location":"configuration-properties/#sparknetworksharedbytebufallocatorsenabled","title":"spark.network.sharedByteBufAllocators.enabled

                            Default: true

                            ","text":""},{"location":"configuration-properties/#sparknetworktimeout","title":"spark.network.timeout

                            Network timeout (in seconds) to use for RPC remote endpoint lookup

                            Default: 120s

                            ","text":""},{"location":"configuration-properties/#sparknetworktimeoutinterval","title":"spark.network.timeoutInterval

                            (in millis)

                            Default: spark.storage.blockManagerTimeoutIntervalMs

                            ","text":""},{"location":"configuration-properties/#sparkrddcompress","title":"spark.rdd.compress

                            Controls whether to compress RDD partitions when stored serialized

                            Default: false

                            ","text":""},{"location":"configuration-properties/#sparkreducermaxblocksinflightperaddress","title":"spark.reducer.maxBlocksInFlightPerAddress

                            Maximum number of remote blocks being fetched per reduce task from a given host port

                            When a large number of blocks are being requested from a given address in a single fetch or simultaneously, this could crash the serving executor or a Node Manager. This is especially useful to reduce the load on the Node Manager when external shuffle is enabled. You can mitigate the issue by setting it to a lower value.

                            Default: (unlimited)

                            Used when:

                            • BlockStoreShuffleReader is requested to read combined records for a reduce task
                            ","text":""},{"location":"configuration-properties/#sparkreducermaxreqsinflight","title":"spark.reducer.maxReqsInFlight

                            Maximum number of remote requests to fetch blocks at any given point

                            When the number of hosts in the cluster increase, it might lead to very large number of inbound connections to one or more nodes, causing the workers to fail under load. By allowing it to limit the number of fetch requests, this scenario can be mitigated

                            Default: (unlimited)

                            Used when:

                            • BlockStoreShuffleReader is requested to read combined records for a reduce task
                            ","text":""},{"location":"configuration-properties/#sparkreducermaxsizeinflight","title":"spark.reducer.maxSizeInFlight

                            Maximum size of all map outputs to fetch simultaneously from each reduce task (in MiB unless otherwise specified)

                            Since each output requires us to create a buffer to receive it, this represents a fixed memory overhead per reduce task, so keep it small unless you have a large amount of memory

                            Default: 48m

                            Used when:

                            • BlockStoreShuffleReader is requested to read combined records for a reduce task
                            ","text":""},{"location":"configuration-properties/#sparkreplclassuri","title":"spark.repl.class.uri

                            Controls whether to compress RDD partitions when stored serialized

                            Default: false

                            ","text":""},{"location":"configuration-properties/#sparkrpclookuptimeout","title":"spark.rpc.lookupTimeout

                            Default Endpoint Lookup Timeout

                            Default: 120s

                            ","text":""},{"location":"configuration-properties/#sparkrpcmessagemaxsize","title":"spark.rpc.message.maxSize

                            Maximum allowed message size for RPC communication (in MB unless specified)

                            Default: 128

                            Must be below 2047MB (Int.MaxValue / 1024 / 1024)

                            Used when:

                            • CoarseGrainedSchedulerBackend is requested to launch tasks
                            • RpcUtils is requested for the maximum message size
                              • Executor is created
                              • MapOutputTrackerMaster is created (and makes sure that spark.shuffle.mapOutput.minSizeForBroadcast is below the threshold)
                            ","text":""},{"location":"configuration-properties/#sparkscheduler","title":"spark.scheduler","text":""},{"location":"configuration-properties/#spark.scheduler.barrier.maxConcurrentTasksCheck.interval","title":"barrier.maxConcurrentTasksCheck.interval","text":"

                            spark.scheduler.barrier.maxConcurrentTasksCheck.interval

                            "},{"location":"configuration-properties/#spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures","title":"barrier.maxConcurrentTasksCheck.maxFailures","text":"

                            spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures

                            "},{"location":"configuration-properties/#spark.scheduler.minRegisteredResourcesRatio","title":"minRegisteredResourcesRatio","text":"

                            spark.scheduler.minRegisteredResourcesRatio

                            Minimum ratio of (registered resources / total expected resources) before submitting tasks

                            Default: (undefined)

                            "},{"location":"configuration-properties/#spark.scheduler.revive.interval","title":"spark.scheduler.revive.interval

                            spark.scheduler.revive.interval

                            The time (in millis) between resource offers revives

                            Default: 1s

                            Used when:

                            • DriverEndpoint is requested to onStart
                            ","text":""},{"location":"configuration-properties/#sparkserializer","title":"spark.serializer

                            The fully-qualified class name of the Serializer (of the driver and executors)

                            Default: org.apache.spark.serializer.JavaSerializer

                            Used when:

                            • SparkEnv utility is used to create a SparkEnv
                            • SparkConf is requested to registerKryoClasses (as a side-effect)
                            ","text":""},{"location":"configuration-properties/#sparkshuffle","title":"spark.shuffle","text":""},{"location":"configuration-properties/#spark.shuffle.sort.io.plugin.class","title":"sort.io.plugin.class

                            spark.shuffle.sort.io.plugin.class

                            Name of the class to use for shuffle IO

                            Default: LocalDiskShuffleDataIO

                            Used when:

                            • ShuffleDataIOUtils is requested to loadShuffleDataIO
                            ","text":""},{"location":"configuration-properties/#spark.shuffle.checksum.enabled","title":"checksum.enabled

                            spark.shuffle.checksum.enabled

                            Controls checksuming of shuffle data. If enabled, Spark will calculate the checksum values for each partition data within the map output file and store the values in a checksum file on the disk. When there's shuffle data corruption detected, Spark will try to diagnose the cause (e.g., network issue, disk issue, etc.) of the corruption by using the checksum file.

                            Default: true

                            ","text":""},{"location":"configuration-properties/#spark.shuffle.compress","title":"compress

                            spark.shuffle.compress

                            Enables compressing shuffle output when stored

                            Default: true

                            ","text":""},{"location":"configuration-properties/#spark.shuffle.detectCorrupt","title":"detectCorrupt

                            spark.shuffle.detectCorrupt

                            Controls corruption detection in fetched blocks

                            Default: true

                            Used when:

                            • BlockStoreShuffleReader is requested to read combined records for a reduce task
                            ","text":""},{"location":"configuration-properties/#spark.shuffle.detectCorrupt.useExtraMemory","title":"detectCorrupt.useExtraMemory

                            spark.shuffle.detectCorrupt.useExtraMemory

                            If enabled, part of a compressed/encrypted stream will be de-compressed/de-crypted by using extra memory to detect early corruption. Any IOException thrown will cause the task to be retried once and if it fails again with same exception, then FetchFailedException will be thrown to retry previous stage

                            Default: false

                            Used when:

                            • BlockStoreShuffleReader is requested to read combined records for a reduce task
                            ","text":""},{"location":"configuration-properties/#spark.shuffle.file.buffer","title":"file.buffer

                            spark.shuffle.file.buffer

                            Size of the in-memory buffer for each shuffle file output stream, in KiB unless otherwise specified. These buffers reduce the number of disk seeks and system calls made in creating intermediate shuffle files.

                            Default: 32k

                            Must be greater than 0 and less than or equal to 2097151 ((Integer.MAX_VALUE - 15) / 1024)

                            Used when the following are created:

                            • BypassMergeSortShuffleWriter
                            • ShuffleExternalSorter
                            • UnsafeShuffleWriter
                            • ExternalAppendOnlyMap
                            • ExternalSorter
                            ","text":""},{"location":"configuration-properties/#spark.shuffle.manager","title":"manager

                            spark.shuffle.manager

                            A fully-qualified class name or the alias of the ShuffleManager in a Spark application

                            Default: sort

                            Supported aliases:

                            • sort
                            • tungsten-sort

                            Used when SparkEnv object is requested to create a \"base\" SparkEnv for a driver or an executor

                            ","text":""},{"location":"configuration-properties/#spark.shuffle.mapOutput.parallelAggregationThreshold","title":"mapOutput.parallelAggregationThreshold

                            spark.shuffle.mapOutput.parallelAggregationThreshold

                            (internal) Multi-thread is used when the number of mappers * shuffle partitions is greater than or equal to this threshold. Note that the actual parallelism is calculated by number of mappers * shuffle partitions / this threshold + 1, so this threshold should be positive.

                            Default: 10000000

                            Used when:

                            • MapOutputTrackerMaster is requested for the statistics of a ShuffleDependency
                            ","text":""},{"location":"configuration-properties/#spark.shuffle.minNumPartitionsToHighlyCompress","title":"minNumPartitionsToHighlyCompress

                            spark.shuffle.minNumPartitionsToHighlyCompress

                            (internal) Minimum number of partitions (threshold) for MapStatus utility to prefer a HighlyCompressedMapStatus (over CompressedMapStatus) (for ShuffleWriters).

                            Default: 2000

                            Must be a positive integer (above 0)

                            ","text":""},{"location":"configuration-properties/#spark.shuffle.push.enabled","title":"push.enabled

                            spark.shuffle.push.enabled

                            Enables push-based shuffle on the client side

                            Default: false

                            Works in conjunction with the server side flag spark.shuffle.push.server.mergedShuffleFileManagerImpl which needs to be set with the appropriate org.apache.spark.network.shuffle.MergedShuffleFileManager implementation for push-based shuffle to be enabled

                            Used when:

                            • Utils utility is used to determine whether push-based shuffle is enabled or not
                            ","text":""},{"location":"configuration-properties/#spark.shuffle.readHostLocalDisk","title":"readHostLocalDisk

                            spark.shuffle.readHostLocalDisk

                            If enabled (with spark.shuffle.useOldFetchProtocol disabled and spark.shuffle.service.enabled enabled), shuffle blocks requested from those block managers which are running on the same host are read from the disk directly instead of being fetched as remote blocks over the network.

                            Default: true

                            ","text":""},{"location":"configuration-properties/#spark.shuffle.registration.maxAttempts","title":"registration.maxAttempts

                            spark.shuffle.registration.maxAttempts

                            How many attempts to register a BlockManager with External Shuffle Service

                            Default: 3

                            Used when BlockManager is requested to register with External Shuffle Server

                            ","text":""},{"location":"configuration-properties/#spark.shuffle.sort.bypassMergeThreshold","title":"sort.bypassMergeThreshold

                            spark.shuffle.sort.bypassMergeThreshold

                            Maximum number of reduce partitions below which SortShuffleManager avoids merge-sorting data for no map-side aggregation

                            Default: 200

                            Used when:

                            • SortShuffleWriter utility is used to shouldBypassMergeSort
                            • ShuffleExchangeExec (Spark SQL) physical operator is requested to prepareShuffleDependency
                            ","text":""},{"location":"configuration-properties/#spark.shuffle.spill.initialMemoryThreshold","title":"spill.initialMemoryThreshold

                            spark.shuffle.spill.initialMemoryThreshold

                            Initial threshold for the size of an in-memory collection

                            Default: 5MB

                            Used by Spillable

                            ","text":""},{"location":"configuration-properties/#spark.shuffle.spill.numElementsForceSpillThreshold","title":"spill.numElementsForceSpillThreshold

                            spark.shuffle.spill.numElementsForceSpillThreshold

                            (internal) The maximum number of elements in memory before forcing the shuffle sorter to spill.

                            Default: Integer.MAX_VALUE

                            The default value is to never force the sorter to spill, until Spark reaches some limitations, like the max page size limitation for the pointer array in the sorter.

                            Used when:

                            • ShuffleExternalSorter is created
                            • Spillable is created
                            • Spark SQL's SortBasedAggregator is requested for an UnsafeKVExternalSorter
                            • Spark SQL's ObjectAggregationMap is requested to dumpToExternalSorter
                            • Spark SQL's UnsafeExternalRowSorter is created
                            • Spark SQL's UnsafeFixedWidthAggregationMap is requested for an UnsafeKVExternalSorter
                            ","text":""},{"location":"configuration-properties/#spark.shuffle.sync","title":"sync

                            spark.shuffle.sync

                            Controls whether DiskBlockObjectWriter should force outstanding writes to disk while committing a single atomic block (i.e. all operating system buffers should synchronize with the disk to ensure that all changes to a file are in fact recorded in the storage)

                            Default: false

                            Used when BlockManager is requested for a DiskBlockObjectWriter

                            ","text":""},{"location":"configuration-properties/#spark.shuffle.useOldFetchProtocol","title":"useOldFetchProtocol

                            spark.shuffle.useOldFetchProtocol

                            Whether to use the old protocol while doing the shuffle block fetching. It is only enabled while we need the compatibility in the scenario of new Spark version job fetching shuffle blocks from old version external shuffle service.

                            Default: false

                            ","text":""},{"location":"configuration-properties/#sparkspeculation","title":"spark.speculation

                            Controls Speculative Execution of Tasks

                            Default: false

                            ","text":""},{"location":"configuration-properties/#sparkspeculationinterval","title":"spark.speculation.interval

                            The time interval to use before checking for speculative tasks in Speculative Execution of Tasks.

                            Default: 100ms

                            ","text":""},{"location":"configuration-properties/#sparkspeculationmultiplier","title":"spark.speculation.multiplier

                            Default: 1.5

                            ","text":""},{"location":"configuration-properties/#sparkspeculationquantile","title":"spark.speculation.quantile

                            The percentage of tasks that has not finished yet at which to start speculation in Speculative Execution of Tasks.

                            Default: 0.75

                            ","text":""},{"location":"configuration-properties/#sparkstorageblockmanagerslavetimeoutms","title":"spark.storage.blockManagerSlaveTimeoutMs

                            (in millis)

                            Default: spark.network.timeout

                            ","text":""},{"location":"configuration-properties/#sparkstorageblockmanagertimeoutintervalms","title":"spark.storage.blockManagerTimeoutIntervalMs

                            (in millis)

                            Default: 60s

                            ","text":""},{"location":"configuration-properties/#sparkstoragelocaldiskbyexecutorscachesize","title":"spark.storage.localDiskByExecutors.cacheSize

                            The max number of executors for which the local dirs are stored. This size is both applied for the driver and both for the executors side to avoid having an unbounded store. This cache will be used to avoid the network in case of fetching disk persisted RDD blocks or shuffle blocks (when spark.shuffle.readHostLocalDisk is set) from the same host.

                            Default: 1000

                            ","text":""},{"location":"configuration-properties/#sparkstoragereplicationpolicy","title":"spark.storage.replication.policy

                            Default: RandomBlockReplicationPolicy

                            ","text":""},{"location":"configuration-properties/#sparkstorageunrollmemorythreshold","title":"spark.storage.unrollMemoryThreshold

                            Initial memory threshold (in bytes) to unroll (materialize) a block to store in memory

                            Default: 1024 * 1024

                            Must be at most the total amount of memory available for storage

                            Used when:

                            • MemoryStore is created
                            ","text":""},{"location":"configuration-properties/#sparksubmitdeploymode","title":"spark.submit.deployMode
                            • client (default)
                            • cluster
                            ","text":""},{"location":"configuration-properties/#sparktaskcpus","title":"spark.task.cpus

                            The number of CPU cores to schedule (allocate) to a task

                            Default: 1

                            Used when:

                            • ExecutorAllocationManager is created
                            • TaskSchedulerImpl is created
                            • AppStatusListener is requested to handle a SparkListenerEnvironmentUpdate event
                            • SparkContext utility is used to create a TaskScheduler
                            • ResourceProfile is requested to getDefaultTaskResources
                            • LocalityPreferredContainerPlacementStrategy is requested to numExecutorsPending
                            ","text":""},{"location":"configuration-properties/#sparktaskmaxdirectresultsize","title":"spark.task.maxDirectResultSize

                            Maximum size of a task result (in bytes) to be sent to the driver as a DirectTaskResult

                            Default: 1048576B (1L << 20)

                            Used when:

                            • TaskRunner is requested to run a task (and decide on the type of a serialized task result)
                            ","text":""},{"location":"configuration-properties/#sparktaskmaxfailures","title":"spark.task.maxFailures

                            Number of failures of a single task (of a TaskSet) before giving up on the entire TaskSet and then the job

                            Default: 4

                            ","text":""},{"location":"configuration-properties/#sparkplugins","title":"spark.plugins

                            A comma-separated list of class names implementing org.apache.spark.api.plugin.SparkPlugin to load into a Spark application.

                            Default: (empty)

                            Since: 3.0.0

                            Set when SparkContext is created

                            ","text":""},{"location":"configuration-properties/#sparkpluginsdefaultlist","title":"spark.plugins.defaultList

                            FIXME

                            ","text":""},{"location":"configuration-properties/#sparkuishowconsoleprogress","title":"spark.ui.showConsoleProgress

                            Controls whether to enable ConsoleProgressBar and show the progress bar in the console

                            Default: false

                            ","text":""},{"location":"developer-api/","title":"Developer API","text":"

                            [TAGS]

                            "},{"location":"driver/","title":"Driver","text":"

                            A Spark driver (aka an application's driver process) is a JVM process that hosts SparkContext.md[SparkContext] for a Spark application. It is the master node in a Spark application.

                            It is the cockpit of jobs and tasks execution (using scheduler:DAGScheduler.md[DAGScheduler] and scheduler:TaskScheduler.md[Task Scheduler]). It hosts spark-webui.md[Web UI] for the environment.

                            .Driver with the services image::spark-driver.png[align=\"center\"]

                            It splits a Spark application into tasks and schedules them to run on executors.

                            A driver is where the task scheduler lives and spawns tasks across workers.

                            A driver coordinates workers and overall execution of tasks.

                            NOTE: spark-shell.md[Spark shell] is a Spark application and the driver. It creates a SparkContext that is available as sc.

                            Driver requires the additional services (beside the common ones like shuffle:ShuffleManager.md[], memory:MemoryManager.md[], storage:BlockTransferService.md[], BroadcastManager:

                            • Listener Bus
                            • rpc:index.md[]
                            • scheduler:MapOutputTrackerMaster.md[] with the name MapOutputTracker
                            • storage:BlockManagerMaster.md[] with the name BlockManagerMaster
                            • MetricsSystem with the name driver
                            • OutputCommitCoordinator

                            CAUTION: FIXME Diagram of RpcEnv for a driver (and later executors). Perhaps it should be in the notes about RpcEnv?

                            • High-level control flow of work
                            • Your Spark application runs as long as the Spark driver. ** Once the driver terminates, so does your Spark application.
                            • Creates SparkContext, RDD's, and executes transformations and actions
                            • Launches scheduler:Task.md[tasks]

                            === [[driver-memory]] Driver's Memory

                            It can be set first using spark-submit/index.md#command-line-options[spark-submit's --driver-memory] command-line option or <> and falls back to spark-submit/index.md#environment-variables[SPARK_DRIVER_MEMORY] if not set earlier.

                            NOTE: It is printed out to the standard error output in spark-submit/index.md#verbose-mode[spark-submit's verbose mode].

                            "},{"location":"driver/#driver-cores","title":"Driver Cores

                            It can be set first using spark-submit/index.md#driver-cores[spark-submit's --driver-cores] command-line option for cluster deploy mode.

                            NOTE: In client deploy mode the driver's memory corresponds to the memory of the JVM process the Spark application runs on.

                            NOTE: It is printed out to the standard error output in spark-submit/index.md#verbose-mode[spark-submit's verbose mode].

                            === [[settings]] Settings

                            .Spark Properties [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Spark Property | Default Value | Description | [[spark_driver_blockManager_port]] spark.driver.blockManager.port | storage:BlockManager.md#spark_blockManager_port[spark.blockManager.port] | Port to use for the storage:BlockManager.md[BlockManager] on the driver.

                            More precisely, spark.driver.blockManager.port is used when core:SparkEnv.md#NettyBlockTransferService[NettyBlockTransferService is created] (while SparkEnv is created for the driver).

                            | [[spark_driver_memory]] spark.driver.memory | 1g | The driver's memory size (in MiBs).

                            Refer to <>.

                            | [[spark_driver_cores]] spark.driver.cores | 1 | The number of CPU cores assigned to the driver in cluster deploy mode.

                            NOTE: When yarn/spark-yarn-client.md#creating-instance[Client is created] (for Spark on YARN in cluster mode only), it sets the number of cores for ApplicationManager using spark.driver.cores.

                            Refer to <>.

                            | [[spark_driver_extraLibraryPath]] spark.driver.extraLibraryPath | |

                            | [[spark_driver_extraJavaOptions]] spark.driver.extraJavaOptions | | Additional JVM options for the driver.

                            | [[spark.driver.appUIAddress]] spark.driver.appUIAddress

                            spark.driver.appUIAddress is used exclusively in yarn/README.md[Spark on YARN]. It is set when yarn/spark-yarn-client-yarnclientschedulerbackend.md#start[YarnClientSchedulerBackend starts] to yarn/spark-yarn-applicationmaster.md#runExecutorLauncher[run ExecutorLauncher] (and yarn/spark-yarn-applicationmaster.md#registerAM[register ApplicationMaster] for the Spark application).

                            | [[spark_driver_libraryPath]] spark.driver.libraryPath | |

                            |===

                            ","text":""},{"location":"driver/#sparkdriverextraclasspath","title":"spark.driver.extraClassPath

                            spark.driver.extraClassPath system property sets the additional classpath entries (e.g. jars and directories) that should be added to the driver's classpath in cluster deploy mode.

                            ","text":""},{"location":"driver/#note","title":"[NOTE]","text":"

                            For client deploy mode you can use a properties file or command line to set spark.driver.extraClassPath.

                            Do not use SparkConf.md[SparkConf] since it is too late for client deploy mode given the JVM has already been set up to start a Spark application.

                            "},{"location":"driver/#refer-to-spark-classmdbuildsparksubmitcommandbuildsparksubmitcommand-internal-method-for-the-very-low-level-details-of-how-it-is-handled-internally","title":"Refer to spark-class.md#buildSparkSubmitCommand[buildSparkSubmitCommand Internal Method] for the very low-level details of how it is handled internally.","text":"

                            spark.driver.extraClassPath uses a OS-specific path separator.

                            NOTE: Use spark-submit's spark-submit/index.md#driver-class-path[--driver-class-path command-line option] on command line to override spark.driver.extraClassPath from a spark-properties.md#spark-defaults-conf[Spark properties file].

                            "},{"location":"local-properties/","title":"Local Properties","text":"

                            SparkContext.setLocalProperty lets you set key-value pairs that will be propagated down to tasks and can be accessed there using TaskContext.getLocalProperty.

                            "},{"location":"local-properties/#creating-logical-job-groups","title":"Creating Logical Job Groups","text":"

                            One of the purposes of local properties is to create logical groups of Spark jobs by means of properties that (regardless of the threads used to submit the jobs) makes the separate jobs launched from different threads belong to a single logical group.

                            A common use case for the local property concept is to set a local property in a thread, say spark-scheduler-FairSchedulableBuilder.md[spark.scheduler.pool], after which all jobs submitted within the thread will be grouped, say into a pool by FAIR job scheduler.

                            val data = sc.parallelize(0 to 9)\n\nsc.setLocalProperty(\"spark.scheduler.pool\", \"myPool\")\n\n// these two jobs (one per action) will run in the myPool pool\ndata.count\ndata.collect\n\nsc.setLocalProperty(\"spark.scheduler.pool\", null)\n\n// this job will run in the default pool\ndata.count\n
                            "},{"location":"master/","title":"Master","text":"

                            == Master

                            A master is a running Spark instance that connects to a cluster manager for resources.

                            The master acquires cluster nodes to run executors.

                            CAUTION: FIXME Add it to the Spark architecture figure above.

                            "},{"location":"overview/","title":"Spark Core","text":"

                            Apache Spark is an open-source distributed general-purpose cluster computing framework with (mostly) in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL.

                            You could also describe Spark as a distributed, data processing engine for batch and streaming modes featuring SQL queries, graph processing, and machine learning.

                            In contrast to Hadoop\u2019s two-stage disk-based MapReduce computation engine, Spark's multi-stage (mostly) in-memory computing engine allows for running most computations in memory, and hence most of the time provides better performance for certain applications, e.g. iterative algorithms or interactive data mining (read Spark officially sets a new record in large-scale sorting).

                            Spark aims at speed, ease of use, extensibility and interactive analytics.

                            Spark is a distributed platform for executing complex multi-stage applications, like machine learning algorithms, and interactive ad hoc queries. Spark provides an efficient abstraction for in-memory cluster computing called Resilient Distributed Dataset.

                            Using Spark Application Frameworks, Spark simplifies access to machine learning and predictive analytics at scale.

                            Spark is mainly written in http://scala-lang.org/[Scala], but provides developer API for languages like Java, Python, and R.

                            If you have large amounts of data that requires low latency processing that a typical MapReduce program cannot provide, Spark is a viable alternative.

                            • Access any data type across any data source.
                            • Huge demand for storage and data processing.

                            The Apache Spark project is an umbrella for https://jaceklaskowski.gitbooks.io/mastering-spark-sql/[SQL] (with Datasets), https://jaceklaskowski.gitbooks.io/spark-structured-streaming/[streaming], http://spark.apache.org/mllib/[machine learning] (pipelines) and http://spark.apache.org/graphx/[graph] processing engines built on top of the Spark Core. You can run them all in a single application using a consistent API.

                            Spark runs locally as well as in clusters, on-premises or in cloud. It runs on top of Hadoop YARN, Apache Mesos, standalone or in the cloud (Amazon EC2 or IBM Bluemix).

                            Apache Spark's https://jaceklaskowski.gitbooks.io/spark-structured-streaming/[Structured Streaming] and https://jaceklaskowski.gitbooks.io/mastering-spark-sql/[SQL] programming models with MLlib and GraphX make it easier for developers and data scientists to build applications that exploit machine learning and graph analytics.

                            At a high level, any Spark application creates RDDs out of some input, run rdd:index.md[(lazy) transformations] of these RDDs to some other form (shape), and finally perform rdd:index.md[actions] to collect or store data. Not much, huh?

                            You can look at Spark from programmer's, data engineer's and administrator's point of view. And to be honest, all three types of people will spend quite a lot of their time with Spark to finally reach the point where they exploit all the available features. Programmers use language-specific APIs (and work at the level of RDDs using transformations and actions), data engineers use higher-level abstractions like DataFrames or Pipelines APIs or external tools (that connect to Spark), and finally it all can only be possible to run because administrators set up Spark clusters to deploy Spark applications to.

                            It is Spark's goal to be a general-purpose computing platform with various specialized applications frameworks on top of a single unified engine.

                            NOTE: When you hear \"Apache Spark\" it can be two things -- the Spark engine aka Spark Core or the Apache Spark open source project which is an \"umbrella\" term for Spark Core and the accompanying Spark Application Frameworks, i.e. Spark SQL, spark-streaming/spark-streaming.md[Spark Streaming], spark-mllib/spark-mllib.md[Spark MLlib] and spark-graphx.md[Spark GraphX] that sit on top of Spark Core and the main data abstraction in Spark called rdd:index.md[RDD - Resilient Distributed Dataset].

                            "},{"location":"overview/#why-spark","title":"Why Spark","text":"

                            Let's list a few of the many reasons for Spark. We are doing it first, and then comes the overview that lends a more technical helping hand.

                            "},{"location":"overview/#easy-to-get-started","title":"Easy to Get Started","text":"

                            Spark offers spark-shell that makes for a very easy head start to writing and running Spark applications on the command line on your laptop.

                            You could then use Spark Standalone built-in cluster manager to deploy your Spark applications to a production-grade cluster to run on a full dataset.

                            "},{"location":"overview/#unified-engine-for-diverse-workloads","title":"Unified Engine for Diverse Workloads","text":"

                            As said by Matei Zaharia - the author of Apache Spark - in Introduction to AmpLab Spark Internals video (quoting with few changes):

                            One of the Spark project goals was to deliver a platform that supports a very wide array of diverse workflows - not only MapReduce batch jobs (there were available in Hadoop already at that time), but also iterative computations like graph algorithms or Machine Learning.

                            And also different scales of workloads from sub-second interactive jobs to jobs that run for many hours.

                            Spark combines batch, interactive, and streaming workloads under one rich concise API.

                            Spark supports near real-time streaming workloads via spark-streaming/spark-streaming.md[Spark Streaming] application framework.

                            ETL workloads and Analytics workloads are different, however Spark attempts to offer a unified platform for a wide variety of workloads.

                            Graph and Machine Learning algorithms are iterative by nature and less saves to disk or transfers over network means better performance.

                            There is also support for interactive workloads using Spark shell.

                            You should watch the video https://youtu.be/SxAxAhn-BDU[What is Apache Spark?] by Mike Olson, Chief Strategy Officer and Co-Founder at Cloudera, who provides a very exceptional overview of Apache Spark, its rise in popularity in the open source community, and how Spark is primed to replace MapReduce as the general processing engine in Hadoop.

                            === Leverages the Best in distributed batch data processing

                            When you think about distributed batch data processing, varia/spark-hadoop.md[Hadoop] naturally comes to mind as a viable solution.

                            Spark draws many ideas out of Hadoop MapReduce. They work together well - Spark on YARN and HDFS - while improving on the performance and simplicity of the distributed computing engine.

                            For many, Spark is Hadoop++, i.e. MapReduce done in a better way.

                            And it should not come as a surprise, without Hadoop MapReduce (its advances and deficiencies), Spark would not have been born at all.

                            === RDD - Distributed Parallel Scala Collections

                            As a Scala developer, you may find Spark's RDD API very similar (if not identical) to http://www.scala-lang.org/docu/files/collections-api/collections.html[Scala's Collections API].

                            It is also exposed in Java, Python and R (as well as SQL, i.e. SparkSQL, in a sense).

                            So, when you have a need for distributed Collections API in Scala, Spark with RDD API should be a serious contender.

                            === [[rich-standard-library]] Rich Standard Library

                            Not only can you use map and reduce (as in Hadoop MapReduce jobs) in Spark, but also a vast array of other higher-level operators to ease your Spark queries and application development.

                            It expanded on the available computation styles beyond the only map-and-reduce available in Hadoop MapReduce.

                            === Unified development and deployment environment for all

                            Regardless of the Spark tools you use - the Spark API for the many programming languages supported - Scala, Java, Python, R, or spark-shell.md[the Spark shell], or the many Spark Application Frameworks leveraging the concept of rdd:index.md[RDD], i.e. Spark SQL, spark-streaming/spark-streaming.md[Spark Streaming], spark-mllib/spark-mllib.md[Spark MLlib] and spark-graphx.md[Spark GraphX], you still use the same development and deployment environment to for large data sets to yield a result, be it a prediction (spark-mllib/spark-mllib.md[Spark MLlib]), a structured data queries (Spark SQL) or just a large distributed batch (Spark Core) or streaming (Spark Streaming) computation.

                            It's also very productive of Spark that teams can exploit the different skills the team members have acquired so far. Data analysts, data scientists, Python programmers, or Java, or Scala, or R, can all use the same Spark platform using tailor-made API. It makes for bringing skilled people with their expertise in different programming languages together to a Spark project.

                            === Interactive Exploration / Exploratory Analytics

                            It is also called ad hoc queries.

                            Using spark-shell.md[the Spark shell] you can execute computations to process large amount of data (The Big Data). It's all interactive and very useful to explore the data before final production release.

                            Also, using the Spark shell you can access any spark-cluster.md[Spark cluster] as if it was your local machine. Just point the Spark shell to a 20-node of 10TB RAM memory in total (using --master) and use all the components (and their abstractions) like Spark SQL, Spark MLlib, spark-streaming/spark-streaming.md[Spark Streaming], and Spark GraphX.

                            Depending on your needs and skills, you may see a better fit for SQL vs programming APIs or apply machine learning algorithms (Spark MLlib) from data in graph data structures (Spark GraphX).

                            === Single Environment

                            Regardless of which programming language you are good at, be it Scala, Java, Python, R or SQL, you can use the same single clustered runtime environment for prototyping, ad hoc queries, and deploying your applications leveraging the many ingestion data points offered by the Spark platform.

                            You can be as low-level as using RDD API directly or leverage higher-level APIs of Spark SQL (Datasets), Spark MLlib (ML Pipelines), Spark GraphX (Graphs) or spark-streaming/spark-streaming.md[Spark Streaming] (DStreams).

                            Or use them all in a single application.

                            The single programming model and execution engine for different kinds of workloads simplify development and deployment architectures.

                            === Data Integration Toolkit with Rich Set of Supported Data Sources

                            Spark can read from many types of data sources -- relational, NoSQL, file systems, etc. -- using many types of data formats - Parquet, Avro, CSV, JSON.

                            Both, input and output data sources, allow programmers and data engineers use Spark as the platform with the large amount of data that is read from or saved to for processing, interactively (using Spark shell) or in applications.

                            === Tools unavailable then, at your fingertips now

                            As much and often as it's recommended http://c2.com/cgi/wiki?PickTheRightToolForTheJob[to pick the right tool for the job], it's not always feasible. Time, personal preference, operating system you work on are all factors to decide what is right at a time (and using a hammer can be a reasonable choice).

                            Spark embraces many concepts in a single unified development and runtime environment.

                            • Machine learning that is so tool- and feature-rich in Python, e.g. SciKit library, can now be used by Scala developers (as Pipeline API in Spark MLlib or calling pipe()).
                            • DataFrames from R are available in Scala, Java, Python, R APIs.
                            • Single node computations in machine learning algorithms are migrated to their distributed versions in Spark MLlib.

                            This single platform gives plenty of opportunities for Python, Scala, Java, and R programmers as well as data engineers (SparkR) and scientists (using proprietary enterprise data warehouses with spark-sql-thrift-server.md[Thrift JDBC/ODBC Server] in Spark SQL).

                            Mind the proverb https://en.wiktionary.org/wiki/if_all_you_have_is_a_hammer,_everything_looks_like_a_nail[if all you have is a hammer, everything looks like a nail], too.

                            === Low-level Optimizations

                            Apache Spark uses a scheduler:DAGScheduler.md[directed acyclic graph (DAG) of computation stages] (aka execution DAG). It postpones any processing until really required for actions. Spark's lazy evaluation gives plenty of opportunities to induce low-level optimizations (so users have to know less to do more).

                            Mind the proverb https://en.wiktionary.org/wiki/less_is_more[less is more].

                            === Excels at low-latency iterative workloads

                            Spark supports diverse workloads, but successfully targets low-latency iterative ones. They are often used in Machine Learning and graph algorithms.

                            Many Machine Learning algorithms require plenty of iterations before the result models get optimal, like logistic regression. The same applies to graph algorithms to traverse all the nodes and edges when needed. Such computations can increase their performance when the interim partial results are stored in memory or at very fast solid state drives.

                            Spark can spark-rdd-caching.md[cache intermediate data in memory for faster model building and training]. Once the data is loaded to memory (as an initial step), reusing it multiple times incurs no performance slowdowns.

                            Also, graph algorithms can traverse graphs one connection per iteration with the partial result in memory.

                            Less disk access and network can make a huge difference when you need to process lots of data, esp. when it is a BIG Data.

                            === ETL done easier

                            Spark gives Extract, Transform and Load (ETL) a new look with the many programming languages supported - Scala, Java, Python (less likely R). You can use them all or pick the best for a problem.

                            Scala in Spark, especially, makes for a much less boiler-plate code (comparing to other languages and approaches like MapReduce in Java).

                            === [[unified-api]] Unified Concise High-Level API

                            Spark offers a unified, concise, high-level APIs for batch analytics (RDD API), SQL queries (Dataset API), real-time analysis (DStream API), machine learning (ML Pipeline API) and graph processing (Graph API).

                            Developers no longer have to learn many different processing engines and platforms, and let the time be spent on mastering framework APIs per use case (atop a single computation engine Spark).

                            === Different kinds of data processing using unified API

                            Spark offers three kinds of data processing using batch, interactive, and stream processing with the unified API and data structures.

                            === Little to no disk use for better performance

                            In the no-so-long-ago times, when the most prevalent distributed computing framework was varia/spark-hadoop.md[Hadoop MapReduce], you could reuse a data between computation (even partial ones!) only after you've written it to an external storage like varia/spark-hadoop.md[Hadoop Distributed Filesystem (HDFS)]. It can cost you a lot of time to compute even very basic multi-stage computations. It simply suffers from IO (and perhaps network) overhead.

                            One of the many motivations to build Spark was to have a framework that is good at data reuse.

                            Spark cuts it out in a way to keep as much data as possible in memory and keep it there until a job is finished. It doesn't matter how many stages belong to a job. What does matter is the available memory and how effective you are in using Spark API (so rdd:index.md[no shuffle occur]).

                            The less network and disk IO, the better performance, and Spark tries hard to find ways to minimize both.

                            === Fault Tolerance included

                            Faults are not considered a special case in Spark, but obvious consequence of being a parallel and distributed system. Spark handles and recovers from faults by default without particularly complex logic to deal with them.

                            === Small Codebase Invites Contributors

                            Spark's design is fairly simple and the code that comes out of it is not huge comparing to the features it offers.

                            The reasonably small codebase of Spark invites project contributors - programmers who extend the platform and fix bugs in a more steady pace.

                            == [[i-want-more]] Further reading or watching

                            • (video) https://youtu.be/L029ZNBG7bk[Keynote: Spark 2.0 - Matei Zaharia, Apache Spark Creator and CTO of Databricks]
                            "},{"location":"push-based-shuffle/","title":"Push-Based Shuffle","text":"

                            Push-Based Shuffle is a new feature of Apache Spark 3.2.0 (cf. SPARK-30602) to improve shuffle efficiency.

                            Push-based shuffle is enabled using spark.shuffle.push.enabled configuration property and can only be used in a Spark application submitted to YARN cluster manager, with external shuffle service enabled, IO encryption disabled, and relocation of serialized objects supported.

                            "},{"location":"spark-debugging/","title":"Debugging Spark","text":""},{"location":"spark-debugging/#using-spark-shell-and-intellij-idea","title":"Using spark-shell and IntelliJ IDEA","text":"

                            Start spark-shell with SPARK_SUBMIT_OPTS environment variable that configures the JVM's JDWP.

                            SPARK_SUBMIT_OPTS=\"-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005\" ./bin/spark-shell\n

                            Attach IntelliJ IDEA to the JVM process using Run > Attach to Local Process menu.

                            "},{"location":"spark-debugging/#using-sbt","title":"Using sbt","text":"

                            Use sbt -jvm-debug 5005, connect to the remote JVM at the port 5005 using IntelliJ IDEA, place breakpoints on the desired lines of the source code of Spark.

                            $ sbt -jvm-debug 5005\nListening for transport dt_socket at address: 5005\n...\n

                            Run Spark context and the breakpoints get triggered.

                            scala> val sc = new SparkContext(conf)\n15/11/14 22:58:46 INFO SparkContext: Running Spark version 1.6.0-SNAPSHOT\n

                            Tip

                            Read Debugging chapter in IntelliJ IDEA's Help.

                            "},{"location":"spark-logging/","title":"Spark Logging","text":"

                            Apache Spark uses Apache Log4j 2 for logging.

                            "},{"location":"spark-logging/#conflog4j2properties","title":"conf/log4j2.properties","text":"

                            The default logging for Spark applications is in conf/log4j2.properties.

                            Use conf/log4j2.properties.template as a starting point.

                            "},{"location":"spark-logging/#logging-levels","title":"Logging Levels

                            The valid logging levels are log4j's Levels (from most specific to least):

                            Name Description OFF No events will be logged FATAL A fatal event that will prevent the application from continuing ERROR An error in the application, possibly recoverable WARN An event that might possible lead to an error INFO An event for informational purposes DEBUG A general debugging event TRACE A fine-grained debug message, typically capturing the flow through the application ALL All events should be logged

                            The names of the logging levels are case-insensitive.

                            ","text":""},{"location":"spark-logging/#turn-logging-off","title":"Turn Logging Off

                            The following sample conf/log4j2.properties turns all logging of Apache Spark (and Apache Hadoop) off.

                            # Set to debug or trace if log4j initialization fails\nstatus = warn\n\n# Name of the configuration\nname = exploring-internals\n\n# Console appender configuration\nappender.console.type = Console\nappender.console.name = consoleLogger\nappender.console.layout.type = PatternLayout\nappender.console.layout.pattern = %d{YYYY-MM-dd HH:mm:ss} [%t] %-5p %c:%L - %m%n\nappender.console.target = SYSTEM_OUT\n\nrootLogger.level = off\nrootLogger.appenderRef.stdout.ref = consoleLogger\n\nlogger.spark.name = org.apache.spark\nlogger.spark.level = off\n\nlogger.hadoop.name = org.apache.hadoop\nlogger.hadoop.level = off\n
                            ","text":""},{"location":"spark-logging/#setting-default-log-level-programatically","title":"Setting Default Log Level Programatically

                            Setting Default Log Level Programatically

                            ","text":""},{"location":"spark-logging/#setting-log-levels-in-spark-applications","title":"Setting Log Levels in Spark Applications

                            In standalone Spark applications or while in Spark Shell session, use the following:

                            import org.apache.log4j.{Level, Logger}\n\nLogger.getLogger(classOf[RackResolver]).getLevel\nLogger.getLogger(\"org\").setLevel(Level.OFF)\n
                            ","text":""},{"location":"spark-properties/","title":"Spark Properties and spark-defaults.conf Properties File","text":"

                            Spark properties are the means of tuning the execution environment of a Spark application.

                            The default Spark properties file is <$SPARK_HOME/conf/spark-defaults.conf>> that could be overriden using spark-submit with the spark-submit/index.md#properties-file[--properties-file] command-line option.

                            .Environment Variables [options=\"header\",width=\"100%\"] |=== | Environment Variable | Default Value | Description | SPARK_CONF_DIR | $\\{SPARK_HOME}/conf | Spark's configuration directory (with spark-defaults.conf) |===

                            TIP: Read the official documentation of Apache Spark on http://spark.apache.org/docs/latest/configuration.html[Spark Configuration].

                            === [[spark-defaults-conf]] spark-defaults.conf -- Default Spark Properties File

                            spark-defaults.conf (under SPARK_CONF_DIR or $SPARK_HOME/conf) is the default properties file with the Spark properties of your Spark applications.

                            NOTE: spark-defaults.conf is loaded by spark-AbstractCommandBuilder.md#loadPropertiesFile[AbstractCommandBuilder's loadPropertiesFile internal method].

                            === [[getDefaultPropertiesFile]] Calculating Path of Default Spark Properties -- Utils.getDefaultPropertiesFile method

                            "},{"location":"spark-properties/#source-scala","title":"[source, scala]","text":""},{"location":"spark-properties/#getdefaultpropertiesfileenv-mapstring-string-sysenv-string","title":"getDefaultPropertiesFile(env: Map[String, String] = sys.env): String","text":"

                            getDefaultPropertiesFile calculates the absolute path to spark-defaults.conf properties file that can be either in directory specified by SPARK_CONF_DIR environment variable or $SPARK_HOME/conf directory.

                            NOTE: getDefaultPropertiesFile is part of private[spark] org.apache.spark.util.Utils object.

                            "},{"location":"spark-tips-and-tricks-access-private-members-spark-shell/","title":"Access private members in Scala in Spark shell","text":"

                            == Access private members in Scala in Spark shell

                            If you ever wanted to use private[spark] members in Spark using the Scala programming language, e.g. toy with org.apache.spark.scheduler.DAGScheduler or similar, you will have to use the following trick in Spark shell - use :paste -raw as described in https://issues.scala-lang.org/browse/SI-5299[REPL: support for package definition].

                            Open spark-shell and execute :paste -raw that allows you to enter any valid Scala code, including package.

                            The following snippet shows how to access private[spark] member DAGScheduler.RESUBMIT_TIMEOUT:

                            scala> :paste -raw\n// Entering paste mode (ctrl-D to finish)\n\npackage org.apache.spark\n\nobject spark {\n  def test = {\n    import org.apache.spark.scheduler._\n    println(DAGScheduler.RESUBMIT_TIMEOUT == 200)\n  }\n}\n\nscala> spark.test\ntrue\n\nscala> sc.version\nres0: String = 1.6.0-SNAPSHOT\n
                            "},{"location":"spark-tips-and-tricks-running-spark-windows/","title":"Running Spark Applications on Windows","text":"

                            == Running Spark Applications on Windows

                            Running Spark applications on Windows in general is no different than running it on other operating systems like Linux or macOS.

                            NOTE: A Spark application could be spark-shell.md[spark-shell] or your own custom Spark application.

                            What makes the huge difference between the operating systems is Hadoop that is used internally for file system access in Spark.

                            You may run into few minor issues when you are on Windows due to the way Hadoop works with Windows' POSIX-incompatible NTFS filesystem.

                            NOTE: You do not have to install Apache Hadoop to work with Spark or run Spark applications.

                            TIP: Read the Apache Hadoop project's https://wiki.apache.org/hadoop/WindowsProblems[Problems running Hadoop on Windows].

                            Among the issues is the infamous java.io.IOException when running Spark Shell (below a stacktrace from Spark 2.0.2 on Windows 10 so the line numbers may be different in your case).

                            16/12/26 21:34:11 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path\njava.io.IOException: Could not locate executable null\\bin\\winutils.exe in the Hadoop binaries.\n  at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)\n  at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)\n  at org.apache.hadoop.util.Shell.<clinit>(Shell.java:387)\n  at org.apache.hadoop.hive.conf.HiveConf$ConfVars.findHadoopBinary(HiveConf.java:2327)\n  at org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:365)\n  at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105)\n  at java.lang.Class.forName0(Native Method)\n  at java.lang.Class.forName(Class.java:348)\n  at org.apache.spark.util.Utils$.classForName(Utils.scala:228)\n  at org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:963)\n  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:91)\n
                            "},{"location":"spark-tips-and-tricks-running-spark-windows/#note","title":"[NOTE]","text":"

                            You need to have Administrator rights on your laptop. All the following commands must be executed in a command-line window (cmd) ran as Administrator, i.e. using Run as administrator option while executing cmd.

                            "},{"location":"spark-tips-and-tricks-running-spark-windows/#read-the-official-document-in-microsoft-technet-httpstechnetmicrosoftcomen-uslibrarycc947813vws10aspxstart-a-command-prompt-as-an-administrator","title":"Read the official document in Microsoft TechNet -- ++https://technet.microsoft.com/en-us/library/cc947813(v=ws.10).aspx++[Start a Command Prompt as an Administrator].","text":"

                            Download winutils.exe binary from https://github.com/steveloughran/winutils repository.

                            NOTE: You should select the version of Hadoop the Spark distribution was compiled with, e.g. use hadoop-2.7.1 for Spark 2 (https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe[here is the direct link to winutils.exe binary]).

                            Save winutils.exe binary to a directory of your choice, e.g. c:\\hadoop\\bin.

                            Set HADOOP_HOME to reflect the directory with winutils.exe (without bin).

                            set HADOOP_HOME=c:\\hadoop\n

                            Set PATH environment variable to include %HADOOP_HOME%\\bin as follows:

                            set PATH=%HADOOP_HOME%\\bin;%PATH%\n

                            TIP: Define HADOOP_HOME and PATH environment variables in Control Panel so any Windows program would use them.

                            Create C:\\tmp\\hive directory.

                            "},{"location":"spark-tips-and-tricks-running-spark-windows/#note_1","title":"[NOTE]","text":"

                            c:\\tmp\\hive directory is the default value of https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.scratchdir[hive.exec.scratchdir configuration property] in Hive 0.14.0 and later and Spark uses a custom build of Hive 1.2.1.

                            "},{"location":"spark-tips-and-tricks-running-spark-windows/#you-can-change-hiveexecscratchdir-configuration-property-to-another-directory-as-described-in-wzxhzdk27-configuration-property-in-this-document","title":"You can change hive.exec.scratchdir configuration property to another directory as described in <hive.exec.scratchdir Configuration Property>> in this document.

                            Execute the following command in cmd that you started using the option Run as administrator.

                            winutils.exe chmod -R 777 C:\\tmp\\hive\n

                            Check the permissions (that is one of the commands that are executed under the covers):

                            winutils.exe ls -F C:\\tmp\\hive\n

                            Open spark-shell and observe the output (perhaps with few WARN messages that you can simply disregard).

                            As a verification step, execute the following line to display the content of a DataFrame:

                            ","text":""},{"location":"spark-tips-and-tricks-running-spark-windows/#source-scala","title":"[source, scala]","text":"

                            scala> spark.range(1).withColumn(\"status\", lit(\"All seems fine. Congratulations!\")).show(false) +---+--------------------------------+ |id |status | +---+--------------------------------+ |0 |All seems fine. Congratulations!| +---+--------------------------------+

                            "},{"location":"spark-tips-and-tricks-running-spark-windows/#note_2","title":"[NOTE]

                            Disregard WARN messages when you start spark-shell. They are harmless.

                            ","text":""},{"location":"spark-tips-and-tricks-running-spark-windows/#161226-220541-warn-general-plugin-bundle-orgdatanucleus-is-already-registered-ensure-you-dont-have-multiple-jar-versions-of-the-same-plugin-in-the-classpath-the-url-filecspark-202-bin-hadoop27jarsdatanucleus-core-3210jar-is-already-registered-and-you-are-trying-to-register-an-identical-plugin-located-at-url-filecspark-202-bin-hadoop27binjarsdatanucleus-core-3210jar-161226-220541-warn-general-plugin-bundle-orgdatanucleusapijdo-is-already-registered-ensure-you-dont-have-multiple-jar-versions-of-the-same-plugin-in-the-classpath-the-url-filecspark-202-bin-hadoop27jarsdatanucleus-api-jdo-326jar-is-already-registered-and-you-are-trying-to-register-an-identical-plugin-located-at-url-filecspark-202-bin-hadoop27binjarsdatanucleus-api-jdo-326jar-161226-220541-warn-general-plugin-bundle-orgdatanucleusstorerdbms-is-already-registered-ensure-you-dont-have-multiple-jar-versions-of-the-same-plugin-in-the-classpath-the-url-filecspark-202-bin-hadoop27binjarsdatanucleus-rdbms-329jar-is-already-registered-and-you-are-trying-to-register-an-identical-plugin-located-at-url-filecspark-202-bin-hadoop27jarsdatanucleus-rdbms-329jar","title":"
                            16/12/26 22:05:41 WARN General: Plugin (Bundle) \"org.datanucleus\" is already registered. Ensure you dont have multiple JAR versions of\nthe same plugin in the classpath. The URL \"file:/C:/spark-2.0.2-bin-hadoop2.7/jars/datanucleus-core-3.2.10.jar\" is already registered,\nand you are trying to register an identical plugin located at URL \"file:/C:/spark-2.0.2-bin-hadoop2.7/bin/../jars/datanucleus-core-\n3.2.10.jar.\"\n16/12/26 22:05:41 WARN General: Plugin (Bundle) \"org.datanucleus.api.jdo\" is already registered. Ensure you dont have multiple JAR\nversions of the same plugin in the classpath. The URL \"file:/C:/spark-2.0.2-bin-hadoop2.7/jars/datanucleus-api-jdo-3.2.6.jar\" is already\nregistered, and you are trying to register an identical plugin located at URL \"file:/C:/spark-2.0.2-bin-\nhadoop2.7/bin/../jars/datanucleus-api-jdo-3.2.6.jar.\"\n16/12/26 22:05:41 WARN General: Plugin (Bundle) \"org.datanucleus.store.rdbms\" is already registered. Ensure you dont have multiple JAR\nversions of the same plugin in the classpath. The URL \"file:/C:/spark-2.0.2-bin-hadoop2.7/bin/../jars/datanucleus-rdbms-3.2.9.jar\" is\nalready registered, and you are trying to register an identical plugin located at URL \"file:/C:/spark-2.0.2-bin-\nhadoop2.7/jars/datanucleus-rdbms-3.2.9.jar.\"\n

                            If you see the above output, you're done. You should now be able to run Spark applications on your Windows. Congrats!

                            === [[changing-hive.exec.scratchdir]] Changing hive.exec.scratchdir Configuration Property

                            Create a hive-site.xml file with the following content:

                            <configuration>\n  <property>\n    <name>hive.exec.scratchdir</name>\n    <value>/tmp/mydir</value>\n    <description>Scratch space for Hive jobs</description>\n  </property>\n</configuration>\n

                            Start a Spark application, e.g. spark-shell, with HADOOP_CONF_DIR environment variable set to the directory with hive-site.xml.

                            HADOOP_CONF_DIR=conf ./bin/spark-shell\n
                            ","text":""},{"location":"spark-tips-and-tricks-sparkexception-task-not-serializable/","title":"Task not serializable Exception","text":"

                            == org.apache.spark.SparkException: Task not serializable

                            When you run into org.apache.spark.SparkException: Task not serializable exception, it means that you use a reference to an instance of a non-serializable class inside a transformation. See the following example:

                            \u279c  spark git:(master) \u2717 ./bin/spark-shell\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 1.6.0-SNAPSHOT\n      /_/\n\nUsing Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)\nType in expressions to have them evaluated.\nType :help for more information.\n\nscala> class NotSerializable(val num: Int)\ndefined class NotSerializable\n\nscala> val notSerializable = new NotSerializable(10)\nnotSerializable: NotSerializable = NotSerializable@2700f556\n\nscala> sc.parallelize(0 to 10).map(_ => notSerializable.num).count\norg.apache.spark.SparkException: Task not serializable\n  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)\n  at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)\n  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)\n  at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)\n  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:318)\n  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:317)\n  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)\n  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)\n  at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)\n  at org.apache.spark.rdd.RDD.map(RDD.scala:317)\n  ... 48 elided\nCaused by: java.io.NotSerializableException: NotSerializable\nSerialization stack:\n    - object not serializable (class: NotSerializable, value: NotSerializable@2700f556)\n    - field (class: $iw, name: notSerializable, type: class NotSerializable)\n    - object (class $iw, $iw@10e542f3)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@729feae8)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@5fc3b20b)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@36dab184)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@5eb974)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@79c514e4)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@5aeaee3)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@2be9425f)\n    - field (class: $line18.$read, name: $iw, type: class $iw)\n    - object (class $line18.$read, $line18.$read@6311640d)\n    - field (class: $iw, name: $line18$read, type: class $line18.$read)\n    - object (class $iw, $iw@c9cd06e)\n    - field (class: $iw, name: $outer, type: class $iw)\n    - object (class $iw, $iw@6565691a)\n    - field (class: $anonfun$1, name: $outer, type: class $iw)\n    - object (class $anonfun$1, <function1>)\n  at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)\n  at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)\n  at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)\n  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)\n  ... 57 more\n

                            === Further reading

                            • https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html[Job aborted due to stage failure: Task not serializable]
                            • https://issues.apache.org/jira/browse/SPARK-5307[Add utility to help with NotSerializableException debugging]
                            • http://stackoverflow.com/q/22592811/1305344[Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects]
                            "},{"location":"spark-tips-and-tricks/","title":"Spark Tips and Tricks","text":"

                            = Spark Tips and Tricks

                            == [[SPARK_PRINT_LAUNCH_COMMAND]] Print Launch Command of Spark Scripts

                            SPARK_PRINT_LAUNCH_COMMAND environment variable controls whether the Spark launch command is printed out to the standard error output, i.e. System.err, or not.

                            Spark Command: [here comes the command]\n========================================\n

                            All the Spark shell scripts use org.apache.spark.launcher.Main class internally that checks SPARK_PRINT_LAUNCH_COMMAND and when set (to any value) will print out the entire command line to launch it.

                            $ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell\nSpark Command: /Library/Java/JavaVirtualMachines/Current/Contents/Home/bin/java -cp /Users/jacek/dev/oss/spark/conf/:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.1.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar -Dscala.usejavacp=true -Xms1g -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://localhost:7077 --class org.apache.spark.repl.Main --name Spark shell spark-shell\n========================================\n

                            == Show Spark version in Spark shell

                            In spark-shell, use sc.version or org.apache.spark.SPARK_VERSION to know the Spark version:

                            scala> sc.version\nres0: String = 1.6.0-SNAPSHOT\n\nscala> org.apache.spark.SPARK_VERSION\nres1: String = 1.6.0-SNAPSHOT\n

                            == Resolving local host name

                            When you face networking issues when Spark can't resolve your local hostname or IP address, use the preferred SPARK_LOCAL_HOSTNAME environment variable as the custom host name or SPARK_LOCAL_IP as the custom IP that is going to be later resolved to a hostname.

                            Spark checks them out before using http://docs.oracle.com/javase/8/docs/api/java/net/InetAddress.html#getLocalHost--[java.net.InetAddress.getLocalHost()] (consult https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L759[org.apache.spark.util.Utils.findLocalInetAddress()] method).

                            You may see the following WARN messages in the logs when Spark finished the resolving process:

                            Your hostname, [hostname] resolves to a loopback address: [host-address]; using...\nSet SPARK_LOCAL_IP if you need to bind to another address\n
                            "},{"location":"spark-tips-and-tricks/#starting-standalone-master-and-workers-on-windows-7","title":"Starting standalone Master and workers on Windows 7","text":"

                            Windows 7 users can use spark-class to start Spark Standalone as there are no launch scripts for the Windows platform.

                            ./bin/spark-class org.apache.spark.deploy.master.Master -h localhost\n
                            ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077\n
                            "},{"location":"speculative-execution-of-tasks/","title":"Speculative Execution of Tasks","text":"

                            Speculative tasks (also speculatable tasks or task strugglers) are tasks that run slower than most (FIXME the setting) of the all tasks in a job.

                            Speculative execution of tasks is a health-check procedure that checks for tasks to be speculated, i.e. running slower in a stage than the median of all successfully completed tasks in a taskset (FIXME the setting). Such slow tasks will be re-submitted to another worker. It will not stop the slow tasks, but run a new copy in parallel.

                            The thread starts as TaskSchedulerImpl starts in spark-cluster.md[clustered deployment modes] with configuration-properties.md#spark.speculation[spark.speculation] enabled. It executes periodically every configuration-properties.md#spark.speculation.interval[spark.speculation.interval] after the initial spark.speculation.interval passes.

                            When enabled, you should see the following INFO message in the logs:

                            "},{"location":"speculative-execution-of-tasks/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"speculative-execution-of-tasks/#starting-speculative-execution-thread","title":"Starting speculative execution thread","text":"

                            It works as scheduler:TaskSchedulerImpl.md#task-scheduler-speculation[task-scheduler-speculation daemon thread pool] (using j.u.c.ScheduledThreadPoolExecutor with core pool size of 1).

                            The job with speculatable tasks should finish while speculative tasks are running, and it will leave these tasks running - no KILL command yet.

                            It uses checkSpeculatableTasks method that asks rootPool to check for speculatable tasks. If there are any, SchedulerBackend is called for scheduler:SchedulerBackend.md#reviveOffers[reviveOffers].

                            CAUTION: FIXME How does Spark handle repeated results of speculative tasks since there are copies launched?

                            "},{"location":"workers/","title":"Workers","text":"

                            == Workers

                            Workers (aka slaves) are running Spark instances where executors live to execute tasks. They are the compute nodes in Spark.

                            CAUTION: FIXME Are workers perhaps part of Spark Standalone only?

                            CAUTION: FIXME How many executors are spawned per worker?

                            A worker receives serialized tasks that it runs in a thread pool.

                            It hosts a local storage:BlockManager.md[Block Manager] that serves blocks to other workers in a Spark cluster. Workers communicate among themselves using their Block Manager instances.

                            CAUTION: FIXME Diagram of a driver with workers as boxes.

                            Explain task execution in Spark and understand Spark\u2019s underlying execution model.

                            New vocabulary often faced in Spark UI

                            SparkContext.md[When you create SparkContext], each worker starts an executor. This is a separate process (JVM), and it loads your jar, too. The executors connect back to your driver program. Now the driver can send them commands, like flatMap, map and reduceByKey. When the driver quits, the executors shut down.

                            A new process is not started for each step. A new process is started on each worker when the SparkContext is constructed.

                            The executor deserializes the command (this is possible because it has loaded your jar), and executes it on a partition.

                            Shortly speaking, an application in Spark is executed in three steps:

                            1. Create RDD graph, i.e. DAG (directed acyclic graph) of RDDs to represent entire computation.
                            2. Create stage graph, i.e. a DAG of stages that is a logical execution plan based on the RDD graph. Stages are created by breaking the RDD graph at shuffle boundaries.
                            3. Based on the plan, schedule and execute tasks on workers.

                            exercises/spark-examples-wordcount-spark-shell.md[In the WordCount example], the RDD graph is as follows:

                            file -> lines -> words -> per-word count -> global word count -> output

                            Based on this graph, two stages are created. The stage creation rule is based on the idea of pipelining as many rdd:index.md[narrow transformations] as possible. RDD operations with \"narrow\" dependencies, like map() and filter(), are pipelined together into one set of tasks in each stage.

                            In the end, every stage will only have shuffle dependencies on other stages, and may compute multiple operations inside it.

                            In the WordCount example, the narrow transformation finishes at per-word count. Therefore, you get two stages:

                            • file -> lines -> words -> per-word count
                            • global word count -> output

                            Once stages are defined, Spark will generate scheduler:Task.md[tasks] from scheduler:Stage.md[stages]. The first stage will create scheduler:ShuffleMapTask.md[ShuffleMapTask]s with the last stage creating scheduler:ResultTask.md[ResultTask]s because in the last stage, one action operation is included to produce results.

                            The number of tasks to be generated depends on how your files are distributed. Suppose that you have 3 three different files in three different nodes, the first stage will generate 3 tasks: one task per partition.

                            Therefore, you should not map your steps to tasks directly. A task belongs to a stage, and is related to a partition.

                            The number of tasks being generated in each stage will be equal to the number of partitions.

                            === [[Cleanup]] Cleanup

                            CAUTION: FIXME

                            === [[settings]] Settings

                            • spark.worker.cleanup.enabled (default: false) <> enabled."},{"location":"accumulators/","title":"Accumulators","text":"

                              Accumulators are shared variables that accumulate values from executors on the driver using associative and commutative \"add\" operation.

                              The main abstraction is AccumulatorV2.

                              Accumulators are registered (created) using SparkContext with or without a name. Only named accumulators are displayed in web UI.

                              DAGScheduler is responsible for updating accumulators (from partial values from tasks running on executors every heartbeat).

                              Accumulators are serializable so they can safely be referenced in the code executed in executors and then safely send over the wire for execution.

                              // on the driver\nval counter = sc.longAccumulator(\"counter\")\n\nsc.parallelize(1 to 9).foreach { x =>\n  // on executors\n  counter.add(x) }\n\n// on the driver\nprintln(counter.value)\n
                              "},{"location":"accumulators/#further-reading","title":"Further Reading","text":"
                              • Performance and Scalability of Broadcast in Spark
                              "},{"location":"accumulators/AccumulableInfo/","title":"AccumulableInfo","text":"

                              AccumulableInfo represents an update to an AccumulatorV2.

                              AccumulableInfo is used to transfer accumulator updates from executors to the driver every executor heartbeat or when a task finishes.

                              "},{"location":"accumulators/AccumulableInfo/#creating-instance","title":"Creating Instance","text":"

                              AccumulableInfo takes the following to be created:

                              • Accumulator ID
                              • Name
                              • Partial Update
                              • Partial Value
                              • internal flag
                              • countFailedValues flag
                              • Metadata (default: None)

                                AccumulableInfo is created\u00a0when:

                                • AccumulatorV2 is requested to convert itself to an AccumulableInfo
                                • JsonProtocol is requested to accumulableInfoFromJson
                                • SQLMetric (Spark SQL) is requested to convert itself to an AccumulableInfo
                                "},{"location":"accumulators/AccumulableInfo/#internal-flag","title":"internal Flag
                                internal: Boolean\n

                                AccumulableInfo is given an internal flag when created.

                                internal flag denotes whether this accumulator is internal.

                                internal is used when:

                                • LiveEntityHelpers is requested for newAccumulatorInfos
                                • JsonProtocol is requested to accumulableInfoToJson
                                ","text":""},{"location":"accumulators/AccumulatorContext/","title":"AccumulatorContext","text":"

                                == [[AccumulatorContext]] AccumulatorContext

                                AccumulatorContext is a private[spark] internal object used to track accumulators by Spark itself using an internal originals lookup table. Spark uses the AccumulatorContext object to register and unregister accumulators.

                                The originals lookup table maps accumulator identifier to the accumulator itself.

                                Every accumulator has its own unique accumulator id that is assigned using the internal nextId counter.

                                === [[register]] register Method

                                CAUTION: FIXME

                                === [[newId]] newId Method

                                CAUTION: FIXME

                                === [[AccumulatorContext-SQL_ACCUM_IDENTIFIER]] AccumulatorContext.SQL_ACCUM_IDENTIFIER

                                AccumulatorContext.SQL_ACCUM_IDENTIFIER is an internal identifier for Spark SQL's internal accumulators. The value is sql and Spark uses it to distinguish spark-sql-SparkPlan.md#SQLMetric[Spark SQL metrics] from others.

                                "},{"location":"accumulators/AccumulatorSource/","title":"AccumulatorSource","text":"

                                AccumulatorSource is...FIXME

                                "},{"location":"accumulators/AccumulatorV2/","title":"AccumulatorV2","text":"

                                AccumulatorV2[IN, OUT] is an abstraction of accumulators

                                AccumulatorV2 is a Java Serializable.

                                "},{"location":"accumulators/AccumulatorV2/#contract","title":"Contract","text":""},{"location":"accumulators/AccumulatorV2/#adding-value","title":"Adding Value
                                add(\n  v: IN): Unit\n

                                Accumulates (adds) the given v value to this accumulator

                                ","text":""},{"location":"accumulators/AccumulatorV2/#copying-accumulator","title":"Copying Accumulator
                                copy(): AccumulatorV2[IN, OUT]\n
                                ","text":""},{"location":"accumulators/AccumulatorV2/#is-zero-value","title":"Is Zero Value
                                isZero: Boolean\n
                                ","text":""},{"location":"accumulators/AccumulatorV2/#merging-updates","title":"Merging Updates
                                merge(\n  other: AccumulatorV2[IN, OUT]): Unit\n
                                ","text":""},{"location":"accumulators/AccumulatorV2/#resetting-accumulator","title":"Resetting Accumulator
                                reset(): Unit\n
                                ","text":""},{"location":"accumulators/AccumulatorV2/#value","title":"Value
                                value: OUT\n

                                The current value of this accumulator

                                Used when:

                                • TaskRunner is requested to collectAccumulatorsAndResetStatusOnFailure
                                • AccumulatorSource is requested to register
                                • DAGScheduler is requested to update accumulators
                                • TaskSchedulerImpl is requested to executorHeartbeatReceived
                                • TaskSetManager is requested to handleSuccessfulTask
                                • JsonProtocol is requested to taskEndReasonFromJson
                                • others
                                ","text":""},{"location":"accumulators/AccumulatorV2/#implementations","title":"Implementations","text":"
                                • AggregatingAccumulator (Spark SQL)
                                • CollectionAccumulator
                                • DoubleAccumulator
                                • EventTimeStatsAccum (Spark Structured Streaming)
                                • LongAccumulator
                                • SetAccumulator (Spark SQL)
                                • SQLMetric (Spark SQL)
                                "},{"location":"accumulators/AccumulatorV2/#converting-this-accumulator-to-accumulableinfo","title":"Converting this Accumulator to AccumulableInfo
                                toInfo(\n  update: Option[Any],\n  value: Option[Any]): AccumulableInfo\n

                                toInfo determines whether the accumulator is internal based on the name (and whether it uses the internal.metrics prefix) and uses it to create an AccumulableInfo.

                                toInfo\u00a0is used when:

                                • TaskRunner is requested to collectAccumulatorsAndResetStatusOnFailure
                                • DAGScheduler is requested to updateAccumulators
                                • TaskSchedulerImpl is requested to executorHeartbeatReceived
                                • JsonProtocol is requested to taskEndReasonFromJson
                                • SQLAppStatusListener (Spark SQL) is requested to handle a SparkListenerTaskEnd event (onTaskEnd)
                                ","text":""},{"location":"accumulators/AccumulatorV2/#registering-accumulator","title":"Registering Accumulator
                                register(\n  sc: SparkContext,\n  name: Option[String] = None,\n  countFailedValues: Boolean = false): Unit\n

                                register...FIXME

                                register\u00a0is used when:

                                • SparkContext is requested to register an accumulator
                                • TaskMetrics is requested to register task accumulators
                                • CollectMetricsExec (Spark SQL) is requested for an AggregatingAccumulator
                                • SQLMetrics (Spark SQL) is used to create a performance metric
                                ","text":""},{"location":"accumulators/AccumulatorV2/#serializing-accumulatorv2","title":"Serializing AccumulatorV2
                                writeReplace(): Any\n

                                writeReplace is part of the Serializable (Java) abstraction (to designate an alternative object to be used when writing an object to the stream).

                                writeReplace...FIXME

                                ","text":""},{"location":"accumulators/AccumulatorV2/#deserializing-accumulatorv2","title":"Deserializing AccumulatorV2
                                readObject(\n  in: ObjectInputStream): Unit\n

                                readObject is part of the Serializable (Java) abstraction (for special handling during deserialization).

                                readObject reads the non-static and non-transient fields of the AccumulatorV2 from the given ObjectInputStream.

                                If the atDriverSide internal flag is turned on, readObject turns it off (to indicate readObject is executed on an executor). Otherwise, atDriverSide internal flag is turned on.

                                readObject requests the active TaskContext to register this accumulator.

                                ","text":""},{"location":"accumulators/InternalAccumulator/","title":"InternalAccumulator","text":"

                                InternalAccumulator is an utility with field names for internal accumulators.

                                "},{"location":"accumulators/InternalAccumulator/#internalmetrics-prefix","title":"internal.metrics Prefix

                                internal.metrics. is the prefix of metrics that are considered internal and should not be displayed in web UI.

                                internal.metrics. is used when:

                                • AccumulatorV2 is requested to convert itself to AccumulableInfo and writeReplace
                                • JsonProtocol is requested to accumValueToJson and accumValueFromJson
                                ","text":""},{"location":"barrier-execution-mode/","title":"Barrier Execution Mode","text":"

                                Barrier Execution Mode (Barrier Scheduling) introduces a strong requirement on Spark Scheduler to launch all tasks of a Barrier Stage at the same time or not at all (and consequently wait until required resources are available). Moreover, a failure of a single task of a barrier stage fails the whole stage (and so the other tasks).

                                Barrier Execution Mode allows for as many tasks to be executed concurrently as ResourceProfile permits (that is enforced upon scheduling a barrier job).

                                Barrier Execution Mode aims at making Distributed Deep Learning with Apache Spark easier (or even possible).

                                Rephrasing dmlc/xgboost, Barrier Execution Mode makes sure that:

                                1. All tasks of a barrier stage are all launched at once. If there is not enough task slots, the exception will be produced

                                2. Tasks either all succeed or fail. Upon a task failure Spark aborts all the other tasks (TaskScheduler will kill all other running tasks) and restarts the whole barrier stage

                                3. Spark makes no assumption that tasks don't talk to each other. Actually, it is the opposite. Spark provides BarrierTaskContext which facilitates tasks discovery (e.g., barrier, allGather)

                                4. Permits restarting a training from a known state (checkpoint) in case of a failure

                                From the Design doc: Barrier Execution Mode:

                                In Spark, a task in a stage doesn't depend on any other task in the same stage, and hence it can be scheduled independently.

                                That gives Spark a freedom to schedule tasks in as many task batches as needed. So, 5 tasks can be scheduled on 1 CPU core quite easily in 5 consecutive batches. That's unlike MPI (or non-MapReduce scheduling systems) that allows for greater flexibility and inter-task dependency.

                                Later in Design doc: Barrier Execution Mode:

                                In MPI, all workers start at the same time and pass messages around.

                                To embed this workload in Spark, we need to introduce a new scheduling model, tentatively named \"barrier scheduling\", which launches the tasks at the same time and provides users enough information and tooling to embed distributed DL training into a Spark pipeline.

                                "},{"location":"barrier-execution-mode/#barrier-rdd","title":"Barrier RDD","text":"

                                Barrier RDD is a RDDBarrier.

                                "},{"location":"barrier-execution-mode/#barrier-stage","title":"Barrier Stage","text":"

                                Barrier Stage is a Stage with at least one Barrier RDD.

                                "},{"location":"barrier-execution-mode/#abstractions","title":"Abstractions","text":"
                                • BarrierTaskContext
                                • RDDBarrier
                                "},{"location":"barrier-execution-mode/#barrier","title":"RDD.barrier Operator","text":"

                                Barrier Execution Mode is based on RDD.barrier operator to indicate that Spark Scheduler must launch the tasks together for the current stage (and mark the current stage as a barrier stage).

                                barrier(): RDDBarrier[T]\n

                                RDD.barrier creates a RDDBarrier that comes with the barrier-aware mapPartitions transformation.

                                mapPartitions[S](\n  f: Iterator[T] => Iterator[S],\n  preservesPartitioning: Boolean = false): RDD[S]\n

                                Under the covers, RDDBarrier.mapPartitions creates a MapPartitionsRDD like the regular RDD.mapPartitions transformation but with isFromBarrier flag enabled.

                                • Task has a isBarrier flag that says whether this task belongs to a barrier stage (default: false).
                                "},{"location":"barrier-execution-mode/#isFromBarrier","title":"isFromBarrier Flag","text":"

                                An RDD is in a barrier stage, if at least one of its parent RDD(s), or itself, are mapped from an RDDBarrier.

                                ShuffledRDD has the isBarrier flag always disabled (false).

                                MapPartitionsRDD is the only RDD that can have the isBarrier flag enabled.

                                RDDBarrier.mapPartitions is the only transformation that creates a MapPartitionsRDD with the isFromBarrier flag enabled.

                                "},{"location":"barrier-execution-mode/#unsupported-spark-features","title":"Unsupported Spark Features","text":"

                                The following Spark features are not supported:

                                • Push-Based Shuffle
                                • Dynamic Allocation of Executors
                                "},{"location":"barrier-execution-mode/#demo","title":"Demo","text":"

                                Enable ALL logging level for org.apache.spark.BarrierTaskContext logger to see what happens inside.

                                val tasksNum = 3\nval nums = sc.parallelize(seq = 0 until 9, numSlices = tasksNum)\nassert(nums.getNumPartitions == tasksNum)\n

                                Print out the available partitions and the number of records within each (using Spark SQL for a human-friendlier output).

                                Scala
                                import org.apache.spark.TaskContext\nnums\n  .mapPartitions { it => Iterator.single((TaskContext.get.partitionId, it.size)) }\n  .toDF(\"partitionId\", \"size\")\n  .show\n
                                +-----------+----+\n|partitionId|size|\n+-----------+----+\n|          0|   3|\n|          1|   3|\n|          2|   3|\n+-----------+----+\n
                                "},{"location":"barrier-execution-mode/#distributed-training","title":"Distributed Training","text":"

                                RDD.barrier creates a Barrier Stage (a RDDBarrier).

                                import org.apache.spark.rdd.RDDBarrier\nassert(nums.barrier.isInstanceOf[RDDBarrier[_]])\n

                                Use RDD.mapPartitions transformation to access a BarrierTaskContext.

                                val barrierRdd = nums\n  .barrier\n  .mapPartitions { ns =>\n    import org.apache.spark.{BarrierTaskContext, TaskContext}\n    val ctx = TaskContext.get.asInstanceOf[BarrierTaskContext]\n    val tid = ctx.partitionId()\n    val port = 10000 + tid\n    val host = \"localhost\"\n    val message = s\"A message from task $tid, e.g. $host:$port it listens at\"\n    val allTaskMessages = ctx.allGather(message)\n\n    if (tid == 0) { // only Task 0 prints out status\n      println(\">>> Got host:port's from the other tasks\")\n      allTaskMessages.foreach(println)\n    }\n\n    if (tid == 0) { // only Task 0 prints out status\n      println(\">>> Starting a distributed training at the nodes...\")\n    }\n\n    ctx.barrier() // this is BarrierTaskContext.barrier (not RDD.barrier)\n                  // which can be confusing\n\n    if (tid == 0) { // only Task 0 prints out status\n      println(\">>> All tasks have finished\")\n    }\n\n    // return a model after combining (model) pieces from the nodes\n    ns\n  }\n

                                Run a distributed computation (using RDD.count action).

                                barrierRdd.count()\n

                                There should be INFO and TRACE messages printed out to the console (given ALL logging level for org.apache.spark.BarrierTaskContext logger).

                                [Executor task launch worker for task 1.0 in stage 5.0 (TID 13)] INFO  org.apache.spark.BarrierTaskContext:60 - Task 13 from Stage 5(Attempt 0) has entered the global sync, current barrier epoch is 0.\n...\n[Executor task launch worker for task 1.0 in stage 5.0 (TID 13)] TRACE org.apache.spark.BarrierTaskContext:68 - Current callSite: CallSite($anonfun$runBarrier$2 at Logging.scala:68,org.apache.spark.BarrierTaskContext.$anonfun$runBarrier$2(BarrierTaskContext.scala:61)\n...\n[Executor task launch worker for task 1.0 in stage 5.0 (TID 13)] INFO  org.apache.spark.BarrierTaskContext:60 - Task 13 from Stage 5(Attempt 0) finished global sync successfully, waited for 1 seconds, current barrier epoch is 1.\n...\n

                                Open up web UI and explore the execution plans.

                                "},{"location":"barrier-execution-mode/#access-mappartitionsrdd","title":"Access MapPartitionsRDD","text":"

                                MapPartitionsRDD is a private[spark] class so to access RDD.isBarrier method requires to be in org.apache.spark package.

                                Paste the following code in spark-shell / Scala REPL using :paste -raw mode.

                                package org.apache.spark\n\nobject IsBarrier {\n  import org.apache.spark.rdd.RDD\n  implicit class BypassPrivateSpark[T](rdd: RDD[T]) {\n    def isBarrier = rdd.isBarrier\n  }\n}\n
                                import org.apache.spark.IsBarrier._\nassert(barrierRdd.isBarrier)\n
                                "},{"location":"barrier-execution-mode/#examples","title":"Examples","text":"

                                Something worth reviewing the source code and learn from it

                                "},{"location":"barrier-execution-mode/#synapseml","title":"SynapseML","text":"

                                SynapseML's LightGBM on Apache Spark can be configured to use Barrier Execution Mode in the following modules:

                                • synapse.ml.lightgbm.LightGBMClassifier
                                • synapse.ml.lightgbm.LightGBMRanker
                                • synapse.ml.lightgbm.LightGBMRegressor
                                "},{"location":"barrier-execution-mode/#xgboost4j","title":"XGBoost4J","text":"

                                XGBoost4J is the JVM package of xgboost (an optimized distributed gradient boosting library with machine learning algorithms for regression and classification under the Gradient Boosting framework).

                                The heart of distributed training in xgboost4j-spark (that can run distributed xgboost on Apache Spark) is XGBoost.trainDistributed.

                                There's a familiar line that creates a barrier stage (using RDD.barrier()):

                                val boostersAndMetrics = trainingRDD.barrier().mapPartitions {\n  // distributed training using XGBoost happens here\n}\n

                                The barrier mapPartitions block finishes is followed by RDD.collect() that gets XGBoost4J-specific metadata (booster and metrics):

                                val (booster, metrics) = boostersAndMetrics.collect()(0)\n

                                Within the barrier stage (within mapPartitions block), xgboost4j-spark builds a distributed booster:

                                1. Checkpointing, when enabled, happens only by Task 0
                                2. All tasks initialize so-called collective Communicator for synchronization
                                3. xgboost4j-spark uses XGBoostJNI to talk to XGBoost using JNI
                                4. Only Task 0 returns non-empty iterator (and that's why the RDD.collect()(0) gets (booster, metrics))
                                5. All tasks execute SXGBoost.train that eventually leads to XGBoost.trainAndSaveCheckpoint
                                "},{"location":"barrier-execution-mode/#learn-more","title":"Learn More","text":"
                                1. SPIP: Support Barrier Execution Mode in Apache Spark (esp. Design: Barrier execution mode)
                                2. Barrier Execution Mode in Spark 3.0 - Part 1 : Introduction
                                "},{"location":"barrier-execution-mode/BarrierCoordinator/","title":"Barrier Coordinator RPC Endpoint","text":"

                                BarrierCoordinator is a ThreadSafeRpcEndpoint that is registered as barrierSync RPC Endpoint when TaskSchedulerImpl is requested to maybeInitBarrierCoordinator.

                                BarrierCoordinator is responsible for handling RequestToSync messages to coordinate Global Syncs of barrier tasks (using allGather and barrier operators).

                                In other words, the driver sets up a BarrierCoordinator (TaskSchedulerImpl precisely) upon startup that BarrierTaskContexts talk to using RequestToSync messages. BarrierCoordinator tracks the number of tasks to wait for until a barrier stage is complete and a response can be sent back to the tasks to continue (that are paused for 365 days (!)).

                                "},{"location":"barrier-execution-mode/BarrierCoordinator/#creating-instance","title":"Creating Instance","text":"

                                BarrierCoordinator takes the following to be created:

                                • Timeout (seconds)
                                • LiveListenerBus
                                • RpcEnv

                                  BarrierCoordinator is created when:

                                  • TaskSchedulerImpl is requested to maybeInitBarrierCoordinator
                                  "},{"location":"barrier-execution-mode/BarrierCoordinator/#receiveAndReply","title":"Processing RequestToSync Messages (from Barrier Tasks)","text":"RpcEndpoint
                                  receiveAndReply(\n  context: RpcCallContext): PartialFunction[Any, Unit]\n

                                  receiveAndReply is part of the RpcEndpoint abstraction.

                                  receiveAndReply handles RequestToSync messages.

                                  Unless already registered, receiveAndReply registers a new ContextBarrierId (for the stageId and the stageAttemptId) in the Barrier States registry.

                                  Multiple Tasks and One BarrierCoordinator

                                  receiveAndReply handles RequestToSync messages, one per task in a barrier stage. Out of all the properties of RequestToSync, numTasks, stageId and stageAttemptId are used.

                                  The very first RequestToSync is used to register the stageId and stageAttemptId (as ContextBarrierId) with numTasks.

                                  receiveAndReply finds the ContextBarrierState for the stage and the stage attempt (in the Barrier States registry) to handle the RequestToSync.

                                  "},{"location":"barrier-execution-mode/BarrierCoordinator/#states","title":"Barrier States","text":"
                                  states: ConcurrentHashMap[ContextBarrierId, ContextBarrierState]\n

                                  BarrierCoordinator creates an empty ConcurrentHashMap (Java) when created.

                                  states registry is used to keep track of all the active barrier stage attempts and the corresponding internal ContextBarrierState.

                                  states is used when:

                                  • onStop to clean up
                                  • cleanupBarrierStage to remove a specific stage attempt
                                  • receiveAndReply to handle RequestToSync messages
                                  "},{"location":"barrier-execution-mode/BarrierCoordinator/#listener","title":"SparkListener","text":"

                                  BarrierCoordinator creates a SparkListener when created.

                                  The SparkListener is used to intercept SparkListenerStageCompleted events.

                                  The SparkListener is addToStatusQueue upon startup and removed at stop.

                                  "},{"location":"barrier-execution-mode/BarrierCoordinator/#onStageCompleted","title":"onStageCompleted","text":"SparkListener
                                  onStageCompleted(\n  stageCompleted: SparkListenerStageCompleted): Unit\n

                                  onStageCompleted is part of the SparkListenerInterface abstraction.

                                  onStageCompleted cleanupBarrierStage for the stage and the attempt number (based on the given SparkListenerStageCompleted).

                                  "},{"location":"barrier-execution-mode/BarrierCoordinator/#logging","title":"Logging","text":"

                                  Enable ALL logging level for org.apache.spark.BarrierCoordinator logger to see what happens inside.

                                  Add the following line to conf/log4j2.properties:

                                  logger.BarrierCoordinator.name = org.apache.spark.BarrierCoordinator\nlogger.BarrierCoordinator.level = all\n

                                  Refer to Logging.

                                  "},{"location":"barrier-execution-mode/BarrierCoordinatorMessage/","title":"BarrierCoordinatorMessage RPC Messages","text":"

                                  BarrierCoordinatorMessage is an abstraction of RPC messages that tasks can send out using BarrierTaskContext operators for BarrierCoordinator to handle.

                                  BarrierCoordinatorMessage is a Serializable (Java) (so it can be sent from executors to the driver over the wire).

                                  "},{"location":"barrier-execution-mode/BarrierCoordinatorMessage/#implementations","title":"Implementations","text":"Sealed Trait

                                  BarrierCoordinatorMessage is a Scala sealed trait which means that all of the implementations are in the same compilation unit (a single file).

                                  Learn more in the Scala Language Specification.

                                  • RequestToSync
                                  "},{"location":"barrier-execution-mode/BarrierJobAllocationFailed/","title":"BarrierJobAllocationFailed","text":"

                                  BarrierJobAllocationFailed is...FIXME

                                  "},{"location":"barrier-execution-mode/BarrierJobSlotsNumberCheckFailed/","title":"BarrierJobSlotsNumberCheckFailed","text":""},{"location":"barrier-execution-mode/BarrierJobSlotsNumberCheckFailed/#barrierjobslotsnumbercheckfailed","title":"BarrierJobSlotsNumberCheckFailed","text":"

                                  BarrierJobSlotsNumberCheckFailed is a BarrierJobAllocationFailed with the following exception message:

                                  [SPARK-24819]: Barrier execution mode does not allow run a barrier stage that requires more slots than the total number of slots in the cluster currently.\nPlease init a new cluster with more resources(e.g. CPU, GPU) or repartition the input RDD(s) to reduce the number of slots required to run this barrier stage.\n

                                  BarrierJobSlotsNumberCheckFailed can be thrown when DAGScheduler is requested to handle a JobSubmitted event.

                                  "},{"location":"barrier-execution-mode/BarrierJobSlotsNumberCheckFailed/#creating-instance","title":"Creating Instance","text":"

                                  BarrierJobSlotsNumberCheckFailed takes the following to be created:

                                  • Required Concurrent Tasks (based on the number of partitions of a barrier RDD)
                                  • Maximum Number of Concurrent Tasks (based on a ResourceProfile used)

                                    BarrierJobSlotsNumberCheckFailed is created when:

                                    • SparkCoreErrors is requested to numPartitionsGreaterThanMaxNumConcurrentTasksError
                                    "},{"location":"barrier-execution-mode/BarrierTaskContext/","title":"BarrierTaskContext \u2014 TaskContext for Barrier Tasks","text":"

                                    BarrierTaskContext is a concrete TaskContext of the tasks in a Barrier Stage in Barrier Execution Mode.

                                    "},{"location":"barrier-execution-mode/BarrierTaskContext/#creating-instance","title":"Creating Instance","text":"

                                    BarrierTaskContext takes the following to be created:

                                    • TaskContext

                                      BarrierTaskContext is created when:

                                      • Task is requested to run (with isBarrier flag enabled)
                                      "},{"location":"barrier-execution-mode/BarrierTaskContext/#barrierCoordinator","title":"Barrier Coordinator RPC Endpoint","text":"
                                      barrierCoordinator: RpcEndpointRef\n

                                      BarrierTaskContext creates a RpcEndpointRef to Barrier Coordinator RPC Endpoint when created.

                                      barrierCoordinator is used to handle barrier and allGather operators (through runBarrier).

                                      "},{"location":"barrier-execution-mode/BarrierTaskContext/#allGather","title":"allGather","text":"
                                      allGather(\n  message: String): Array[String]\n

                                      allGather runBarrier with the given message and ALL_GATHER request method.

                                      Public API and PySpark

                                      allGather is part of a public API.

                                      allGather is used in BasePythonRunner.WriterThread (PySpark) when requested to barrierAndServe.

                                      "},{"location":"barrier-execution-mode/BarrierTaskContext/#barrier","title":"barrier","text":"
                                      barrier(): Unit\n

                                      barrier runBarrier with no message and BARRIER request method.

                                      Public API and PySpark

                                      barrier is part of a public API.

                                      barrier is used in BasePythonRunner.WriterThread (PySpark) when requested to barrierAndServe.

                                      "},{"location":"barrier-execution-mode/BarrierTaskContext/#runBarrier","title":"Global Sync","text":"
                                      runBarrier(\n  message: String,\n  requestMethod: RequestMethod.Value): Array[String]\n

                                      runBarrier prints out the following INFO message to the logs:

                                      Task [taskAttemptId] from Stage [stageId](Attempt [stageAttemptNumber]) has entered the global sync, current barrier epoch is [barrierEpoch].\n

                                      runBarrier prints out the following TRACE message to the logs:

                                      Current callSite: [callSite]\n

                                      runBarrier schedules a TimerTask (Java) to print out the following INFO message to the logs every minute:

                                      Task [taskAttemptId] from Stage [stageId](Attempt [stageAttemptNumber]) waiting under the global sync since [startTime],\nhas been waiting for [duration] seconds,\ncurrent barrier epoch is [barrierEpoch].\n

                                      runBarrier requests the Barrier Coordinator RPC Endpoint to send a RequestToSync one-off message and waits 365 days (!) for a response (a collection of responses from all the barrier tasks).

                                      1 Year to Wait for Response from Barrier Coordinator

                                      runBarrier uses 1 year to wait until the response arrives.

                                      runBarrier checks every second if the response \"bundle\" arrived.

                                      runBarrier increments the barrierEpoch.

                                      runBarrier prints out the following INFO message to the logs:

                                      Task [taskAttemptId] from Stage [stageId](Attempt [stageAttemptNumber]) finished global sync successfully,\nwaited for [duration] seconds,\ncurrent barrier epoch is [barrierEpoch].\n

                                      In the end, runBarrier returns the response \"bundle\" (a collection of responses from all the barrier tasks).

                                      In case of a SparkException, runBarrier prints out the following INFO message to the logs and reports (re-throws) the exception up (the call chain):

                                      Task [taskAttemptId] from Stage [stageId](Attempt [stageAttemptNumber]) failed to perform global sync,\nwaited for [duration] seconds,\ncurrent barrier epoch is [barrierEpoch].\n

                                      runBarrier is used when:

                                      • BarrierTaskContext is requested to barrier, allGather
                                      "},{"location":"barrier-execution-mode/BarrierTaskContext/#logging","title":"Logging","text":"

                                      Enable ALL logging level for org.apache.spark.BarrierTaskContext logger to see what happens inside.

                                      Add the following line to conf/log4j2.properties:

                                      logger.BarrierTaskContext.name = org.apache.spark.BarrierTaskContext\nlogger.BarrierTaskContext.level = all\n

                                      Refer to Logging.

                                      "},{"location":"barrier-execution-mode/ContextBarrierState/","title":"ContextBarrierState","text":"

                                      ContextBarrierState represents the state of global sync of a barrier stage (with the number of tasks).

                                      ContextBarrierState is used by BarrierCoordinator to handle RequestToSync messages (and to keep track of active barrier stage attempts).

                                      ContextBarrierState

                                      ContextBarrierState is a private class of BarrierCoordinator.

                                      "},{"location":"barrier-execution-mode/ContextBarrierState/#creating-instance","title":"Creating Instance","text":"

                                      ContextBarrierState takes the following to be created:

                                      • ContextBarrierId
                                      • Number of Tasks (of a barrier stage)

                                        ContextBarrierState is created when:

                                        • BarrierCoordinator is requested to handle a RequestToSync message for a new stage and stage attempt IDs
                                        "},{"location":"barrier-execution-mode/ContextBarrierState/#barrierId","title":"Barrier Stage Attempt (ContextBarrierId)","text":"

                                        ContextBarrierState is given a ContextBarrierId (of a barrier stage) when created.

                                        The ContextBarrierId uniquely identifies a barrier stage by the stage and stage attempt IDs.

                                        "},{"location":"barrier-execution-mode/ContextBarrierState/#barrierEpoch","title":"Barrier Epoch","text":"

                                        ContextBarrierState initializes barrierEpoch counter to be 0 when created.

                                        "},{"location":"barrier-execution-mode/ContextBarrierState/#requesters","title":"Barrier Tasks","text":"
                                        requesters: ArrayBuffer[RpcCallContext]\n

                                        requesters is a registry of RpcCallContexts of the barrier tasks (of a barrier stage attempt) pending a reply.

                                        It is only when the number of RpcCallContexts in the requesters reaches the number of tasks expected (while handling RequestToSync requests) that this ContextBarrierState is considered finished successfully.

                                        ContextBarrierState initializes requesters when created to be of number of tasks size.

                                        A new RpcCallContext of a barrier task is added in handleRequest only when the epoch of the barrier task matches the current barrierEpoch.

                                        "},{"location":"barrier-execution-mode/ContextBarrierState/#timerTask","title":"TimerTask","text":"
                                        timerTask: TimerTask\n

                                        ContextBarrierState uses a TimerTask (Java) to ensure that a barrier() call can time out.

                                        ContextBarrierState creates a TimerTask (Java) when requested to initTimerTask when requested to handle a RequestToSync message for the first global sync message received (when the requesters is empty). The TimerTask is then immediately scheduled to be executed after spark.barrier.sync.timeout.

                                        spark.barrier.sync.timeout

                                        Since spark.barrier.sync.timeout defaults to 365d (1 year), the TimerTask will run only after one year.

                                        The TimerTask is stopped in cancelTimerTask.

                                        "},{"location":"barrier-execution-mode/ContextBarrierState/#initTimerTask","title":"Initializing TimerTask","text":"
                                        initTimerTask(\n  state: ContextBarrierState): Unit\n

                                        initTimerTask creates a new TimerTask (Java) that, when executed, sends a SparkException to all the requesters with the following message followed by cleanupBarrierStage for this ContextBarrierId.

                                        The coordinator didn't get all barrier sync requests\nfor barrier epoch [barrierEpoch] from [barrierId] within [timeoutInSecs] second(s).\n

                                        The TimerTask is made available as timerTask.

                                        initTimerTask is used when:

                                        • ContextBarrierState is requested to handle a RequestToSync message (for the first global sync message received when the requesters is empty)
                                        "},{"location":"barrier-execution-mode/ContextBarrierState/#messages","title":"messages","text":"

                                        ContextBarrierState initializes messages registry of messages from all numTasks barrier tasks (of a barrier stage attempt) when created.

                                        messages registry is empty.

                                        A new message is registered (added) when handling a RequestToSync request.

                                        "},{"location":"barrier-execution-mode/ContextBarrierState/#handleRequest","title":"Handling RequestToSync Message","text":"
                                        handleRequest(\n  requester: RpcCallContext,\n  request: RequestToSync): Unit\n

                                        handleRequest makes sure that the RequestMethod (of the given RequestToSync) is consistent across barrier tasks (using requestMethods registry).

                                        handleRequest asserts that the number of tasks is this numTasks, and so consistent across barrier tasks. Otherwise, handleRequest reports IllegalArgumentException:

                                        Number of tasks of [barrierId] is [numTasks] from Task [taskId], previously it was [numTasks].\n

                                        handleRequest prints out the following INFO message to the logs (with the ContextBarrierId and barrierEpoch):

                                        Current barrier epoch for [barrierId] is [barrierEpoch].\n

                                        For the first sync message received (requesters is empty), handleRequest initializes the TimerTask and schedules it for execution after the timeoutInSecs.

                                        Timeout

                                        Starting the timerTask ensures that a sync may eventually time out (after a configured delay).

                                        handleRequest registers the given requester in the requesters.

                                        handleRequest registers the message of the RequestToSync in the messages for the partitionId.

                                        handleRequest prints out the following INFO message to the logs:

                                        Barrier sync epoch [barrierEpoch] from [barrierId] received update from Task taskId,\ncurrent progress: [requesters]/[numTasks].\n
                                        "},{"location":"barrier-execution-mode/ContextBarrierState/#updates-from-all-barrier-tasks-received","title":"Updates from All Barrier Tasks Received","text":"

                                        When the barrier sync received updates from all barrier tasks (i.e., the number of requesters is the numTasks), handleRequest replies back to all the requesters with the messages.

                                        handleRequest prints out the following INFO message to the logs:

                                        Barrier sync epoch [barrierEpoch] from [barrierId] received all updates from tasks,\nfinished successfully.\n

                                        handleRequest increments the barrierEpoch, clears the requesters and the requestMethods, and then cancelTimerTask.

                                        In case of the epoch of the given RequestToSync being different from this barrierEpoch, handleRequest sends back a failure message (with a SparkException) to the given requester:

                                        The request to sync of [barrierId] with barrier epoch [barrierEpoch] has already finished.\nMaybe task [taskId] is not properly killed.\n

                                        In case of different RequestMethods (in requestMethods registry), handleRequest sends back a failure message to the requesters (incl. the given requester):

                                        Different barrier sync types found for the sync [barrierId]: [requestMethods].\nPlease use the same barrier sync type within a single sync.\n

                                        handleRequest clear.

                                        handleRequest is used when:

                                        • BarrierCoordinator is requested to handle a RequestToSync message
                                        "},{"location":"barrier-execution-mode/ContextBarrierState/#logging","title":"Logging","text":"

                                        ContextBarrierState is a private class of BarrierCoordinator and logging is configured using the logger of BarrierCoordinator.

                                        "},{"location":"barrier-execution-mode/RDDBarrier/","title":"RDDBarrier","text":"

                                        RDDBarrier is a wrapper around RDD with two custom map transformations:

                                        • mapPartitions
                                        • mapPartitionsWithIndex

                                        Unlike regular RDD.mapPartitions transformations, RDDBarrier transformations create a MapPartitionsRDD with isFromBarrier flag enabled.

                                        RDDBarrier (of T records) marks the current stage as a barrier stage in Barrier Execution Mode.

                                        "},{"location":"barrier-execution-mode/RDDBarrier/#creating-instance","title":"Creating Instance","text":"

                                        RDDBarrier takes the following to be created:

                                        • RDD (of T records)

                                          RDDBarrier is created when:

                                          • RDD.barrier transformation is used
                                          "},{"location":"barrier-execution-mode/RequestMethod/","title":"RequestMethod","text":"

                                          RequestMethod represents the allowed request methods of RequestToSyncs (that are sent out from barrier tasks using BarrierTaskContext).

                                          ContextBarrierState tracks RequestMethods (from tasks inside a barrier sync) to make sure that the tasks are all part of a legitimate barrier sync. All tasks should make sure that they're calling the same method within the same barrier sync phase.

                                          "},{"location":"barrier-execution-mode/RequestMethod/#BARRIER","title":"BARRIER","text":"

                                          Marks execution of BarrierTaskContext.barrier

                                          "},{"location":"barrier-execution-mode/RequestMethod/#ALL_GATHER","title":"ALL_GATHER","text":"

                                          Marks execution of BarrierTaskContext.allGather

                                          "},{"location":"barrier-execution-mode/RequestToSync/","title":"RequestToSync RPC Message","text":"

                                          RequestToSync is a BarrierCoordinatorMessage to start Global Sync phase.

                                          RequestToSync is sent out from BarrierTaskContext (i.e., barrier tasks on executors) to a BarrierCoordinator (on the driver) to handle.

                                          Operation Message Request Message allGather User-defined message ALL_GATHER barrier empty BARRIER"},{"location":"barrier-execution-mode/RequestToSync/#creating-instance","title":"Creating Instance","text":"

                                          RequestToSync takes the following to be created:

                                          • Number of tasks (partitions)
                                          • Stage ID
                                          • Stage Attempt ID
                                          • Task Attempt ID
                                          • BarrierEpoch
                                          • Partition ID
                                          • Message
                                          • RequestMethod

                                            RequestToSync is created when:

                                            • BarrierTaskContext is requested for a Global Sync
                                            "},{"location":"broadcast-variables/","title":"Broadcast Variables","text":"

                                            From the official documentation about Broadcast Variables:

                                            Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

                                            And later in the document:

                                            Explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

                                            Spark uses SparkContext to create broadcast variables and BroadcastManager with ContextCleaner to manage their lifecycle.

                                            Not only can Spark developers use broadcast variables for efficient data distribution, but Spark itself uses them quite often too. A very notable use case is when Spark distributes tasks (to executors) for execution.

                                            The idea is to transfer values used in transformations from a driver to executors in a most effective way so they are copied once and used many times by tasks (rather than being copied every time a task is launched).

                                            "},{"location":"broadcast-variables/#lifecycle-of-broadcast-variable","title":"Lifecycle of Broadcast Variable

                                            Broadcast variables (TorrentBroadcasts, actually) are created using SparkContext.broadcast method.

                                            scala> val b = sc.broadcast(1)\nb: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(0)\n

                                            Tip

                                            Enable DEBUG logging level for org.apache.spark.storage.BlockManager logger to debug broadcast method.

                                            With DEBUG logging level enabled, there should be the following messages printed out to the logs:

                                            Put block broadcast_0 locally took  430 ms\nPutting block broadcast_0 without replication took  431 ms\nTold master about block broadcast_0_piece0\nPut block broadcast_0_piece0 locally took  4 ms\nPutting block broadcast_0_piece0 without replication took  4 ms\n

                                            A broadcast variable is stored on the driver's BlockManager as a single value and separately as chunks (of spark.broadcast.blockSize).

                                            When requested for the broadcast value, TorrentBroadcast reads the broadcast block from the local BroadcastManager and, if fails, from the local BlockManager. Only when the local lookups fail, TorrentBroadcast reads the broadcast block chunks (from the BlockMannagers on the other executors), persists them as a single broadcast variable (in the local BlockManager) and caches in BroadcastManager.

                                            scala> b.value\nres0: Int = 1\n

                                            Broadcast.value is the only way to access the value of a broadcast variable in a Spark transformation. You can only access the broadcast value any time until the broadcast variable is destroyed.

                                            With DEBUG logging level enabled, there should be the following messages printed out to the logs:

                                            Getting local block broadcast_0\nLevel for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)\n

                                            In the end, broadcast variables should be destroyed to release memory.

                                            b.destroy\n

                                            With DEBUG logging level enabled, there should be the following messages printed out to the logs:

                                            Removing broadcast 0\nRemoving block broadcast_0_piece0\nTold master about block broadcast_0_piece0\nRemoving block broadcast_0\n

                                            Broadcast variables can optionally be unpersisted.

                                            b.unpersist\n
                                            ","text":""},{"location":"broadcast-variables/#introduction","title":"Introduction

                                            You use broadcast variable to implement map-side join, i.e. a join using a map. For this, lookup tables are distributed across nodes in a cluster using broadcast and then looked up inside map (to do the join implicitly).

                                            When you broadcast a value, it is copied to executors only once (while it is copied multiple times for tasks otherwise). It means that broadcast can help to get your Spark application faster if you have a large value to use in tasks or there are more tasks than executors.

                                            It appears that a Spark idiom emerges that uses broadcast with collectAsMap to create a Map for broadcast. When an RDD is map over to a smaller dataset (column-wise not record-wise), collectAsMap, and broadcast, using the very big RDD to map its elements to the broadcast RDDs is computationally faster.

                                            val acMap = sc.broadcast(myRDD.map { case (a,b,c,b) => (a, c) }.collectAsMap)\nval otherMap = sc.broadcast(myOtherRDD.collectAsMap)\n\nmyBigRDD.map { case (a, b, c, d) =>\n  (acMap.value.get(a).get, otherMap.value.get(c).get)\n}.collect\n

                                            Use large broadcasted HashMaps over RDDs whenever possible and leave RDDs with a key to lookup necessary data as demonstrated above.

                                            ","text":""},{"location":"broadcast-variables/#demo","title":"Demo

                                            You're going to use a static mapping of interesting projects with their websites, i.e. Map[String, String] that the tasks, i.e. closures (anonymous functions) in transformations, use.

                                            val pws = Map(\n  \"Apache Spark\" -> \"http://spark.apache.org/\",\n  \"Scala\" -> \"http://www.scala-lang.org/\")\n\nval websites = sc.parallelize(Seq(\"Apache Spark\", \"Scala\")).map(pws).collect\n// websites: Array[String] = Array(http://spark.apache.org/, http://www.scala-lang.org/)\n

                                            It works, but is very ineffective as the pws map is sent over the wire to executors while it could have been there already. If there were more tasks that need the pws map, you could improve their performance by minimizing the number of bytes that are going to be sent over the network for task execution.

                                            Enter broadcast variables.

                                            val pwsB = sc.broadcast(pws)\nval websites = sc.parallelize(Seq(\"Apache Spark\", \"Scala\")).map(pwsB.value).collect\n// websites: Array[String] = Array(http://spark.apache.org/, http://www.scala-lang.org/)\n

                                            Semantically, the two computations - with and without the broadcast value - are exactly the same, but the broadcast-based one wins performance-wise when there are more executors spawned to execute many tasks that use pws map.

                                            ","text":""},{"location":"broadcast-variables/#further-reading-or-watching","title":"Further Reading or Watching
                                            • Map-Side Join in Spark
                                            ","text":""},{"location":"broadcast-variables/Broadcast/","title":"Broadcast","text":"

                                            Broadcast[T] is an abstraction of broadcast variables (with the value of type T).

                                            "},{"location":"broadcast-variables/Broadcast/#contract","title":"Contract","text":""},{"location":"broadcast-variables/Broadcast/#destroying-variable","title":"Destroying Variable
                                            doDestroy(\n  blocking: Boolean): Unit\n

                                            Destroys all the data and metadata related to this broadcast variable

                                            Used when:

                                            • Broadcast is requested to destroy
                                            ","text":""},{"location":"broadcast-variables/Broadcast/#unpersisting-variable","title":"Unpersisting Variable
                                            doUnpersist(\n  blocking: Boolean): Unit\n

                                            Deletes the cached copies of this broadcast value on executors

                                            Used when:

                                            • Broadcast is requested to unpersist
                                            ","text":""},{"location":"broadcast-variables/Broadcast/#broadcast-value","title":"Broadcast Value
                                            getValue(): T\n

                                            Gets the broadcast value

                                            Used when:

                                            • Broadcast is requested for the value
                                            ","text":""},{"location":"broadcast-variables/Broadcast/#implementations","title":"Implementations","text":"
                                            • TorrentBroadcast
                                            "},{"location":"broadcast-variables/Broadcast/#creating-instance","title":"Creating Instance","text":"

                                            Broadcast takes the following to be created:

                                            • Unique Identifier Abstract Class

                                              Broadcast\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete Broadcasts.

                                              "},{"location":"broadcast-variables/Broadcast/#serializable","title":"Serializable

                                              Broadcast is a Serializable (Java) so it can be serialized (converted to bytes) and send over the wire from the driver to executors.

                                              ","text":""},{"location":"broadcast-variables/Broadcast/#destroying","title":"Destroying
                                              destroy(): Unit // (1)\ndestroy(\n  blocking: Boolean): Unit\n
                                              1. Non-blocking destroy (blocking is false)

                                              destroy removes persisted data and metadata associated with this broadcast variable.

                                              Note

                                              Once a broadcast variable has been destroyed, it cannot be used again.

                                              ","text":""},{"location":"broadcast-variables/Broadcast/#unpersisting","title":"Unpersisting
                                              unpersist(): Unit // (1)\nunpersist(\n  blocking: Boolean): Unit\n
                                              1. Non-blocking unpersist (blocking is false)

                                              unpersist...FIXME

                                              ","text":""},{"location":"broadcast-variables/Broadcast/#brodcast-value","title":"Brodcast Value
                                              value: T\n

                                              value makes sure that it was not destroyed and gets the value.

                                              ","text":""},{"location":"broadcast-variables/Broadcast/#text-representation","title":"Text Representation
                                              toString: String\n

                                              toString uses the id as follows:

                                              Broadcast([id])\n
                                              ","text":""},{"location":"broadcast-variables/Broadcast/#validation","title":"Validation

                                              Broadcast is considered valid until destroyed.

                                              Broadcast throws a SparkException (with the text representation) when destroyed but requested for the value, to unpersist or destroy:

                                              Attempted to use [toString] after it was destroyed ([destroySite])\n
                                              ","text":""},{"location":"broadcast-variables/BroadcastFactory/","title":"BroadcastFactory","text":"

                                              BroadcastFactory is an abstraction of broadcast variable factories that BroadcastManager uses to create or delete (unbroadcast) broadcast variables.

                                              "},{"location":"broadcast-variables/BroadcastFactory/#contract","title":"Contract","text":""},{"location":"broadcast-variables/BroadcastFactory/#initialize","title":"Initializing","text":"
                                              initialize(\n  isDriver: Boolean,\n  conf: SparkConf): Unit\n
                                              Procedure

                                              initialize is a procedure (returns Unit) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).

                                              See:

                                              • TorrentBroadcastFactory

                                              Used when:

                                              • BroadcastManager is requested to initialize
                                              "},{"location":"broadcast-variables/BroadcastFactory/#newBroadcast","title":"Creating Broadcast Variable","text":"
                                              newBroadcast[T: ClassTag](\n  value: T,\n  isLocal: Boolean,\n  id: Long,\n  serializedOnly: Boolean = false): Broadcast[T]\n

                                              See:

                                              • TorrentBroadcastFactory

                                              Used when:

                                              • BroadcastManager is requested for a new broadcast variable
                                              "},{"location":"broadcast-variables/BroadcastFactory/#stop","title":"Stopping","text":"
                                              stop(): Unit\n
                                              Procedure

                                              stop is a procedure (returns Unit) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).

                                              See:

                                              • TorrentBroadcastFactory

                                              Used when:

                                              • BroadcastManager is requested to stop
                                              "},{"location":"broadcast-variables/BroadcastFactory/#unbroadcast","title":"Deleting Broadcast Variable","text":"
                                              unbroadcast(\n  id: Long,\n  removeFromDriver: Boolean,\n  blocking: Boolean): Unit\n
                                              Procedure

                                              unbroadcast is a procedure (returns Unit) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).

                                              See:

                                              • TorrentBroadcastFactory

                                              Used when:

                                              • BroadcastManager is requested to delete a broadcast variable (unbroadcast)
                                              "},{"location":"broadcast-variables/BroadcastFactory/#implementations","title":"Implementations","text":"
                                              • TorrentBroadcastFactory
                                              "},{"location":"broadcast-variables/BroadcastManager/","title":"BroadcastManager","text":"

                                              BroadcastManager manages a TorrentBroadcastFactory.

                                              Note

                                              As of Spark 2.0, it is no longer possible to plug a custom BroadcastFactory in, and TorrentBroadcastFactory is the only known implementation.

                                              "},{"location":"broadcast-variables/BroadcastManager/#creating-instance","title":"Creating Instance","text":"

                                              BroadcastManager takes the following to be created:

                                              • isDriver flag
                                              • SparkConf
                                              • SecurityManager

                                                While being created, BroadcastManager is requested to initialize.

                                                BroadcastManager is created\u00a0when:

                                                • SparkEnv utility is used to create a base SparkEnv (for the driver and executors)
                                                "},{"location":"broadcast-variables/BroadcastManager/#initializing","title":"Initializing
                                                initialize(): Unit\n

                                                Unless initialized already, initialize creates a TorrentBroadcastFactory and requests it to initialize itself.

                                                ","text":""},{"location":"broadcast-variables/BroadcastManager/#torrentbroadcastfactory","title":"TorrentBroadcastFactory

                                                BroadcastManager manages a BroadcastFactory:

                                                • Creates and initializes it when created (and requested to initialize)

                                                • Stops it when stopped

                                                BroadcastManager uses the BroadcastFactory when requested for the following:

                                                • Creating a new broadcast variable
                                                • Deleting a broadcast variable
                                                ","text":""},{"location":"broadcast-variables/BroadcastManager/#creating-broadcast-variable","title":"Creating Broadcast Variable
                                                newBroadcast(\n  value_ : T,\n  isLocal: Boolean): Broadcast[T]\n

                                                newBroadcast requests the BroadcastFactory for a new broadcast variable (with the next available broadcast ID).

                                                newBroadcast\u00a0is used when:

                                                • SparkContext is requested for a new broadcast variable
                                                • MapOutputTracker utility is used to serializeMapStatuses
                                                ","text":""},{"location":"broadcast-variables/BroadcastManager/#unique-identifiers-of-broadcast-variables","title":"Unique Identifiers of Broadcast Variables

                                                BroadcastManager tracks broadcast variables and assigns unique and continuous identifiers.

                                                ","text":""},{"location":"broadcast-variables/BroadcastManager/#mapoutputtrackermaster","title":"MapOutputTrackerMaster

                                                BroadcastManager is used to create a MapOutputTrackerMaster

                                                ","text":""},{"location":"broadcast-variables/BroadcastManager/#deleting-broadcast-variable","title":"Deleting Broadcast Variable
                                                unbroadcast(\n  id: Long,\n  removeFromDriver: Boolean,\n  blocking: Boolean): Unit\n

                                                unbroadcast requests the BroadcastFactory to delete a broadcast variable (by id).

                                                unbroadcast\u00a0is used when:

                                                • ContextCleaner is requested to clean up a broadcast variable
                                                ","text":""},{"location":"broadcast-variables/TorrentBroadcast/","title":"TorrentBroadcast","text":"

                                                TorrentBroadcast is a Broadcast that uses a BitTorrent-like protocol for broadcast blocks distribution.

                                                "},{"location":"broadcast-variables/TorrentBroadcast/#creating-instance","title":"Creating Instance","text":"

                                                TorrentBroadcast takes the following to be created:

                                                • Broadcast Value (of type T)
                                                • Identifier

                                                  TorrentBroadcast is created\u00a0when:

                                                  • TorrentBroadcastFactory is requested for a new broadcast variable
                                                  "},{"location":"broadcast-variables/TorrentBroadcast/#broadcastblockid","title":"BroadcastBlockId

                                                  TorrentBroadcast creates a BroadcastBlockId (with the id) when created

                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#number-of-block-chunks","title":"Number of Block Chunks

                                                  TorrentBroadcast uses numBlocks for the number of blocks of a broadcast variable (that was blockified into when created).

                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#transient-lazy-broadcast-value","title":"Transient Lazy Broadcast Value
                                                  _value: T\n

                                                  TorrentBroadcast uses _value transient registry for the value that is computed on demand (and cached afterwards).

                                                  _value is a @transient private lazy val and uses the following Scala language features:

                                                  1. It is not serialized when the TorrentBroadcast is serialized to be sent over the wire to executors (and has to be re-computed afterwards)
                                                  2. It is lazily instantiated when first requested and cached afterwards
                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#value","title":"Value
                                                  getValue(): T\n

                                                  getValue uses the _value transient registry for the value if available (non-null).

                                                  Otherwise, getValue reads the broadcast block (from the local BroadcastManager, BlockManager or falls back to readBlocks).

                                                  getValue saves the object in the _value registry.

                                                  getValue\u00a0is part of the Broadcast abstraction.

                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#reading-broadcast-block","title":"Reading Broadcast Block
                                                  readBroadcastBlock(): T\n

                                                  readBroadcastBlock looks up the BroadcastBlockId in (the cache of) BroadcastManager and returns the value if found.

                                                  Otherwise, readBroadcastBlock setConf and requests the BlockManager for the locally-stored broadcast data.

                                                  If the broadcast block is found locally, readBroadcastBlock requests the BroadcastManager to cache it and returns the value.

                                                  If not found locally, readBroadcastBlock multiplies the numBlocks by the blockSize for an estimated size of the broadcast block. readBroadcastBlock prints out the following INFO message to the logs:

                                                  Started reading broadcast variable [id] with [numBlocks] pieces\n(estimated total size [estimatedTotalSize])\n

                                                  readBroadcastBlock readBlocks and prints out the following INFO message to the logs:

                                                  Reading broadcast variable [id] took [time] ms\n

                                                  readBroadcastBlock unblockifies the block chunks into an object (using the Serializer and the CompressionCodec).

                                                  readBroadcastBlock requests the BlockManager to store the merged copy (so other tasks on this executor don't need to re-fetch it). readBroadcastBlock uses MEMORY_AND_DISK storage level and the tellMaster flag off.

                                                  readBroadcastBlock requests the BroadcastManager to cache it and returns the value.

                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#unblockifying-broadcast-value","title":"Unblockifying Broadcast Value
                                                  unBlockifyObject(\n  blocks: Array[InputStream],\n  serializer: Serializer,\n  compressionCodec: Option[CompressionCodec]): T\n

                                                  unBlockifyObject...FIXME

                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#reading-broadcast-block-chunks","title":"Reading Broadcast Block Chunks
                                                  readBlocks(): Array[BlockData]\n

                                                  readBlocks creates a collection of BlockDatas for numBlocks block chunks.

                                                  For every block (randomly-chosen by block ID between 0 and numBlocks), readBlocks creates a BroadcastBlockId for the id (of the broadcast variable) and the chunk (identified by the piece prefix followed by the ID).

                                                  readBlocks prints out the following DEBUG message to the logs:

                                                  Reading piece [pieceId] of [broadcastId]\n

                                                  readBlocks first tries to look up the piece locally by requesting the BlockManager to getLocalBytes and, if found, stores the reference in the local block array (for the piece ID).

                                                  If not found in the local BlockManager, readBlocks requests the BlockManager to getRemoteBytes.

                                                  With checksumEnabled, readBlocks...FIXME

                                                  readBlocks requests the BlockManager to store the chunk (so other tasks on this executor don't need to re-fetch it) using MEMORY_AND_DISK_SER storage level and reporting to the driver (so other executors can pull these chunks from this executor as well).

                                                  readBlocks creates a ByteBufferBlockData for the chunk (and stores it in the blocks array).

                                                  readBlocks throws a SparkException for blocks neither available locally nor remotely:

                                                  Failed to get [pieceId] of [broadcastId]\n
                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#compressioncodec","title":"CompressionCodec
                                                  compressionCodec: Option[CompressionCodec]\n

                                                  TorrentBroadcast uses the spark.broadcast.compress configuration property for the CompressionCodec to use for writeBlocks and readBroadcastBlock.

                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#broadcast-block-chunk-size","title":"Broadcast Block Chunk Size

                                                  TorrentBroadcast uses the spark.broadcast.blockSize configuration property for the size of the chunks (pieces) of a broadcast block.

                                                  TorrentBroadcast uses the size for writeBlocks and readBroadcastBlock.

                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#persisting-broadcast-to-blockmanager","title":"Persisting Broadcast (to BlockManager)
                                                  writeBlocks(\n  value: T): Int\n

                                                  writeBlocks returns the number of blocks (chunks) this broadcast variable (was blockified into).

                                                  The whole broadcast value is stored in the local BlockManager with MEMORY_AND_DISK storage level while the block chunks with MEMORY_AND_DISK_SER storage level.

                                                  writeBlocks\u00a0is used when:

                                                  • TorrentBroadcast is created (that happens on the driver only)

                                                  writeBlocks requests the BlockManager to store the given broadcast value (to be identified as the broadcastId and with the MEMORY_AND_DISK storage level).

                                                  writeBlocks blockify the object (into chunks of the block size, the Serializer, and the optional compressionCodec).

                                                  With checksumEnabled writeBlocks...FIXME

                                                  For every block, writeBlocks creates a BroadcastBlockId for the id and piece[index] identifier, and requests the BlockManager to store the chunk bytes (with MEMORY_AND_DISK_SER storage level and reporting to the driver).

                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#blockifying-broadcast-variable","title":"Blockifying Broadcast Variable
                                                  blockifyObject(\n  obj: T,\n  blockSize: Int,\n  serializer: Serializer,\n  compressionCodec: Option[CompressionCodec]): Array[ByteBuffer]\n

                                                  blockifyObject divides (blockifies) the input obj broadcast value into blocks (ByteBuffer chunks). blockifyObject uses the given Serializer to write the value in a serialized format to a ChunkedByteBufferOutputStream of the given blockSize size with the optional CompressionCodec.

                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#error-handling","title":"Error Handling

                                                  In case of any error, writeBlocks prints out the following ERROR message to the logs and requests the local BlockManager to remove the broadcast.

                                                  Store broadcast [broadcastId] fail, remove all pieces of the broadcast\n

                                                  In case of an error while storing the value itself, writeBlocks throws a SparkException:

                                                  Failed to store [broadcastId] in BlockManager\n

                                                  In case of an error while storing the chunks of the blockified value, writeBlocks throws a SparkException:

                                                  Failed to store [pieceId] of [broadcastId] in local BlockManager\n
                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#destroying-variable","title":"Destroying Variable
                                                  doDestroy(\n  blocking: Boolean): Unit\n

                                                  doDestroy removes the persisted state (associated with the broadcast variable) on all the nodes in a Spark application (the driver and executors).

                                                  doDestroy\u00a0is part of the Broadcast abstraction.

                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#unpersisting-variable","title":"Unpersisting Variable
                                                  doUnpersist(\n  blocking: Boolean): Unit\n

                                                  doUnpersist removes the persisted state (associated with the broadcast variable) on executors only.

                                                  doUnpersist\u00a0is part of the Broadcast abstraction.

                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#removing-persisted-state-broadcast-blocks-of-broadcast-variable","title":"Removing Persisted State (Broadcast Blocks) of Broadcast Variable
                                                  unpersist(\n  id: Long,\n  removeFromDriver: Boolean,\n  blocking: Boolean): Unit\n

                                                  unpersist prints out the following DEBUG message to the logs:

                                                  Unpersisting TorrentBroadcast [id]\n

                                                  In the end, unpersist requests the BlockManagerMaster to remove the blocks of the given broadcast.

                                                  unpersist is used when:

                                                  • TorrentBroadcast is requested to unpersist and destroy
                                                  • TorrentBroadcastFactory is requested to unbroadcast
                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#setconf","title":"setConf
                                                  setConf(\n  conf: SparkConf): Unit\n

                                                  setConf uses the given SparkConf to initialize the compressionCodec, the blockSize and the checksumEnabled.

                                                  setConf is used when:

                                                  • TorrentBroadcast is created and re-created (when deserialized on executors)
                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#logging","title":"Logging

                                                  Enable ALL logging level for org.apache.spark.broadcast.TorrentBroadcast logger to see what happens inside.

                                                  Add the following line to conf/log4j.properties:

                                                  log4j.logger.org.apache.spark.broadcast.TorrentBroadcast=ALL\n

                                                  Refer to Logging.

                                                  ","text":""},{"location":"broadcast-variables/TorrentBroadcastFactory/","title":"TorrentBroadcastFactory","text":"

                                                  TorrentBroadcastFactory is a BroadcastFactory of TorrentBroadcasts.

                                                  Note

                                                  As of Spark 2.0 TorrentBroadcastFactory is the only known BroadcastFactory.

                                                  "},{"location":"broadcast-variables/TorrentBroadcastFactory/#creating-instance","title":"Creating Instance","text":"

                                                  TorrentBroadcastFactory takes no arguments to be created.

                                                  TorrentBroadcastFactory is created for BroadcastManager.

                                                  "},{"location":"broadcast-variables/TorrentBroadcastFactory/#newBroadcast","title":"Creating Broadcast Variable","text":"BroadcastFactory
                                                  newBroadcast[T: ClassTag](\n  value_ : T,\n  isLocal: Boolean,\n  id: Long,\n  serializedOnly: Boolean = false): Broadcast[T]\n

                                                  newBroadcast\u00a0is part of the BroadcastFactory abstraction.

                                                  newBroadcast creates a new TorrentBroadcast with the given value_ and id (and ignoring isLocal).

                                                  "},{"location":"broadcast-variables/TorrentBroadcastFactory/#unbroadcast","title":"Deleting Broadcast Variable","text":"BroadcastFactory
                                                  unbroadcast(\n  id: Long,\n  removeFromDriver: Boolean,\n  blocking: Boolean): Unit\n

                                                  unbroadcast\u00a0is part of the BroadcastFactory abstraction.

                                                  unbroadcast removes all persisted state associated with the broadcast variable (identified by id).

                                                  "},{"location":"broadcast-variables/TorrentBroadcastFactory/#initialize","title":"Initializing","text":"BroadcastFactory
                                                  initialize(\n  isDriver: Boolean,\n  conf: SparkConf): Unit\n

                                                  initialize\u00a0is part of the BroadcastFactory abstraction.

                                                  initialize does nothing (noop).

                                                  "},{"location":"broadcast-variables/TorrentBroadcastFactory/#stop","title":"Stopping","text":"BroadcastFactory
                                                  stop(): Unit\n

                                                  stop\u00a0is part of the BroadcastFactory abstraction.

                                                  stop does nothing (noop).

                                                  "},{"location":"core/BlockFetchStarter/","title":"BlockFetchStarter","text":"

                                                  BlockFetchStarter is the <> of...FIXME...to <>.

                                                  [[contract]] [[createAndStart]] [source, java]

                                                  void createAndStart(String[] blockIds, BlockFetchingListener listener) throws IOException, InterruptedException;

                                                  createAndStart is used when:

                                                  • NettyBlockTransferService is requested to storage:NettyBlockTransferService.md#fetchBlocks[fetchBlocks] (when network:TransportConf.md#io.maxRetries[maxIORetries] is 0)

                                                  • RetryingBlockFetcher is requested to core:RetryingBlockFetcher.md#fetchAllOutstanding[fetchAllOutstanding]

                                                  "},{"location":"core/BlockFetchingListener/","title":"BlockFetchingListener","text":"

                                                  BlockFetchingListener\u00a0is an extension of the EventListener (Java) abstraction that want to be notified about block fetch success and failures.

                                                  BlockFetchingListener is used to create a OneForOneBlockFetcher, OneForOneBlockPusher and RetryingBlockFetcher.

                                                  "},{"location":"core/BlockFetchingListener/#contract","title":"Contract","text":""},{"location":"core/BlockFetchingListener/#onblockfetchfailure","title":"onBlockFetchFailure
                                                  void onBlockFetchFailure(\n  String blockId,\n  Throwable exception)\n
                                                  ","text":""},{"location":"core/BlockFetchingListener/#onblockfetchsuccess","title":"onBlockFetchSuccess
                                                  void onBlockFetchSuccess(\n  String blockId,\n  ManagedBuffer data)\n
                                                  ","text":""},{"location":"core/BlockFetchingListener/#implementations","title":"Implementations","text":"
                                                  • \"Unnamed\" in ShuffleBlockFetcherIterator
                                                  • \"Unnamed\" in BlockTransferService
                                                  • RetryingBlockFetchListener
                                                  "},{"location":"core/CleanerListener/","title":"CleanerListener","text":"

                                                  = CleanerListener

                                                  CleanerListener is an abstraction of listeners that can be core:ContextCleaner.md#attachListener[registered with ContextCleaner] to be informed when <>, <>, <>, <> and <> are cleaned.

                                                  == [[rddCleaned]] rddCleaned Callback Method

                                                  "},{"location":"core/CleanerListener/#source-scala","title":"[source, scala]","text":"

                                                  rddCleaned( rddId: Int): Unit

                                                  rddCleaned is used when...FIXME

                                                  == [[broadcastCleaned]] broadcastCleaned Callback Method

                                                  "},{"location":"core/CleanerListener/#source-scala_1","title":"[source, scala]","text":"

                                                  broadcastCleaned( broadcastId: Long): Unit

                                                  broadcastCleaned is used when...FIXME

                                                  == [[shuffleCleaned]] shuffleCleaned Callback Method

                                                  "},{"location":"core/CleanerListener/#source-scala_2","title":"[source, scala]","text":"

                                                  shuffleCleaned( shuffleId: Int, blocking: Boolean): Unit

                                                  shuffleCleaned is used when...FIXME

                                                  == [[accumCleaned]] accumCleaned Callback Method

                                                  "},{"location":"core/CleanerListener/#source-scala_3","title":"[source, scala]","text":"

                                                  accumCleaned( accId: Long): Unit

                                                  accumCleaned is used when...FIXME

                                                  == [[checkpointCleaned]] checkpointCleaned Callback Method

                                                  "},{"location":"core/CleanerListener/#source-scala_4","title":"[source, scala]","text":"

                                                  checkpointCleaned( rddId: Long): Unit

                                                  checkpointCleaned is used when...FIXME

                                                  "},{"location":"core/ContextCleaner/","title":"ContextCleaner","text":"

                                                  ContextCleaner is a Spark service that is responsible for <> (cleanup) of <>, <>, <>, <> and <> that is aimed at reducing the memory requirements of long-running data-heavy Spark applications.

                                                  "},{"location":"core/ContextCleaner/#creating-instance","title":"Creating Instance","text":"

                                                  ContextCleaner takes the following to be created:

                                                  • [[sc]] SparkContext.md[]

                                                  ContextCleaner is created and requested to start when SparkContext is created with configuration-properties.md#spark.cleaner.referenceTracking[spark.cleaner.referenceTracking] configuration property enabled.

                                                  == [[cleaningThread]] Spark Context Cleaner Cleaning Thread

                                                  ContextCleaner uses a daemon thread Spark Context Cleaner to clean RDD, shuffle, and broadcast states.

                                                  The Spark Context Cleaner thread is started when ContextCleaner is requested to <>.

                                                  == [[listeners]][[attachListener]] CleanerListeners

                                                  ContextCleaner allows attaching core:CleanerListener.md[CleanerListeners] to be informed when objects are cleaned using attachListener method.

                                                  "},{"location":"core/ContextCleaner/#sourcescala","title":"[source,scala]","text":"

                                                  attachListener( listener: CleanerListener): Unit

                                                  == [[doCleanupRDD]] doCleanupRDD Method

                                                  "},{"location":"core/ContextCleaner/#source-scala","title":"[source, scala]","text":"

                                                  doCleanupRDD( rddId: Int, blocking: Boolean): Unit

                                                  doCleanupRDD...FIXME

                                                  doCleanupRDD is used when ContextCleaner is requested to <> for a CleanRDD.

                                                  == [[keepCleaning]] keepCleaning Internal Method

                                                  "},{"location":"core/ContextCleaner/#source-scala_1","title":"[source, scala]","text":""},{"location":"core/ContextCleaner/#keepcleaning-unit","title":"keepCleaning(): Unit","text":"

                                                  keepCleaning runs indefinitely until ContextCleaner is requested to <>. keepCleaning...FIXME

                                                  keepCleaning prints out the following DEBUG message to the logs:

                                                  "},{"location":"core/ContextCleaner/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#got-cleaning-task-task","title":"Got cleaning task [task]","text":"

                                                  keepCleaning is used in <> that is started once when ContextCleaner is requested to <>.

                                                  == [[registerRDDCheckpointDataForCleanup]] registerRDDCheckpointDataForCleanup Method

                                                  "},{"location":"core/ContextCleaner/#source-scala_2","title":"[source, scala]","text":"

                                                  registerRDDCheckpointDataForCleanupT: Unit

                                                  registerRDDCheckpointDataForCleanup...FIXME

                                                  registerRDDCheckpointDataForCleanup is used when ContextCleaner is requested to <> (with configuration-properties.md#spark.cleaner.referenceTracking.cleanCheckpoints[spark.cleaner.referenceTracking.cleanCheckpoints] configuration property enabled).

                                                  == [[registerBroadcastForCleanup]] registerBroadcastForCleanup Method

                                                  "},{"location":"core/ContextCleaner/#source-scala_3","title":"[source, scala]","text":"

                                                  registerBroadcastForCleanupT: Unit

                                                  registerBroadcastForCleanup...FIXME

                                                  registerBroadcastForCleanup is used when SparkContext is used to SparkContext.md#broadcast[create a broadcast variable].

                                                  == [[registerRDDForCleanup]] registerRDDForCleanup Method

                                                  "},{"location":"core/ContextCleaner/#source-scala_4","title":"[source, scala]","text":"

                                                  registerRDDForCleanup( rdd: RDD[_]): Unit

                                                  registerRDDForCleanup...FIXME

                                                  registerRDDForCleanup is used for rdd:RDD.md#persist[RDD.persist] operation.

                                                  == [[registerAccumulatorForCleanup]] registerAccumulatorForCleanup Method

                                                  "},{"location":"core/ContextCleaner/#source-scala_5","title":"[source, scala]","text":"

                                                  registerAccumulatorForCleanup( a: AccumulatorV2[_, _]): Unit

                                                  registerAccumulatorForCleanup...FIXME

                                                  registerAccumulatorForCleanup is used when AccumulatorV2 is requested to register.

                                                  == [[stop]] Stopping ContextCleaner

                                                  "},{"location":"core/ContextCleaner/#source-scala_6","title":"[source, scala]","text":""},{"location":"core/ContextCleaner/#stop-unit","title":"stop(): Unit","text":"

                                                  stop...FIXME

                                                  stop is used when SparkContext is requested to SparkContext.md#stop[stop].

                                                  == [[start]] Starting ContextCleaner

                                                  "},{"location":"core/ContextCleaner/#source-scala_7","title":"[source, scala]","text":""},{"location":"core/ContextCleaner/#start-unit","title":"start(): Unit","text":"

                                                  start starts the <> and an action to request the JVM garbage collector (using System.gc()) on regular basis per configuration-properties.md#spark.cleaner.periodicGC.interval[spark.cleaner.periodicGC.interval] configuration property.

                                                  The action to request the JVM GC is scheduled on <>.

                                                  start is used when SparkContext is created.

                                                  == [[periodicGCService]] periodicGCService Single-Thread Executor Service

                                                  periodicGCService is an internal single-thread {java-javadoc-url}/java/util/concurrent/ScheduledExecutorService.html[executor service] with the name context-cleaner-periodic-gc to request the JVM garbage collector.

                                                  The periodic runs are started when <> and stopped when <>.

                                                  == [[registerShuffleForCleanup]] Registering ShuffleDependency for Cleanup

                                                  "},{"location":"core/ContextCleaner/#source-scala_8","title":"[source, scala]","text":"

                                                  registerShuffleForCleanup( shuffleDependency: ShuffleDependency[_, _, _]): Unit

                                                  registerShuffleForCleanup registers the given ShuffleDependency for cleanup.

                                                  Internally, registerShuffleForCleanup simply executes <> for the given ShuffleDependency.

                                                  registerShuffleForCleanup is used when ShuffleDependency is created.

                                                  == [[registerForCleanup]] Registering Object Reference For Cleanup

                                                  "},{"location":"core/ContextCleaner/#source-scala_9","title":"[source, scala]","text":"

                                                  registerForCleanup( objectForCleanup: AnyRef, task: CleanupTask): Unit

                                                  registerForCleanup adds the input objectForCleanup to the <> internal queue.

                                                  Despite the widest-possible AnyRef type of the input objectForCleanup, the type is really CleanupTaskWeakReference which is a custom Java's {java-javadoc-url}/java/lang/ref/WeakReference.html[java.lang.ref.WeakReference].

                                                  registerForCleanup is used when ContextCleaner is requested to <>, <>, <>, <>, and <>.

                                                  == [[doCleanupShuffle]] Shuffle Cleanup

                                                  "},{"location":"core/ContextCleaner/#source-scala_10","title":"[source, scala]","text":"

                                                  doCleanupShuffle( shuffleId: Int, blocking: Boolean): Unit

                                                  doCleanupShuffle performs a shuffle cleanup which is to remove the shuffle from the current scheduler:MapOutputTrackerMaster.md[MapOutputTrackerMaster] and storage:BlockManagerMaster.md[BlockManagerMaster]. doCleanupShuffle also notifies core:CleanerListener.md[CleanerListeners].

                                                  Internally, when executed, doCleanupShuffle prints out the following DEBUG message to the logs:

                                                  "},{"location":"core/ContextCleaner/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#cleaning-shuffle-id","title":"Cleaning shuffle [id]","text":"

                                                  doCleanupShuffle uses core:SparkEnv.md[SparkEnv] to access the core:SparkEnv.md#mapOutputTracker[MapOutputTracker] to scheduler:MapOutputTracker.md#unregisterShuffle[unregister the given shuffle].

                                                  doCleanupShuffle uses core:SparkEnv.md[SparkEnv] to access the core:SparkEnv.md#blockManager[BlockManagerMaster] to storage:BlockManagerMaster.md#removeShuffle[remove the shuffle blocks] (for the given shuffleId).

                                                  doCleanupShuffle informs all registered <> that core:CleanerListener.md#shuffleCleaned[shuffle was cleaned].

                                                  In the end, doCleanupShuffle prints out the following DEBUG message to the logs:

                                                  "},{"location":"core/ContextCleaner/#sourceplaintext_2","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#cleaned-shuffle-id","title":"Cleaned shuffle [id]","text":"

                                                  In case of any exception, doCleanupShuffle prints out the following ERROR message to the logs and the exception itself:

                                                  "},{"location":"core/ContextCleaner/#sourceplaintext_3","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#error-cleaning-shuffle-id","title":"Error cleaning shuffle [id]","text":"

                                                  doCleanupShuffle is used when ContextCleaner is requested to <> and (interestingly) while fitting an ALSModel (in Spark MLlib).

                                                  == [[logging]] Logging

                                                  Enable ALL logging level for org.apache.spark.ContextCleaner logger to see what happens inside.

                                                  Add the following line to conf/log4j.properties:

                                                  "},{"location":"core/ContextCleaner/#sourceplaintext_4","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#log4jloggerorgapachesparkcontextcleanerall","title":"log4j.logger.org.apache.spark.ContextCleaner=ALL","text":"

                                                  Refer to spark-logging.md[Logging].

                                                  == [[internal-properties]] Internal Properties

                                                  === [[referenceBuffer]] referenceBuffer

                                                  === [[referenceQueue]] referenceQueue

                                                  "},{"location":"core/InMemoryStore/","title":"InMemoryStore","text":"

                                                  InMemoryStore is a KVStore.

                                                  "},{"location":"core/InMemoryStore/#creating-instance","title":"Creating Instance","text":"

                                                  InMemoryStore takes no arguments to be created.

                                                  InMemoryStore is created when:

                                                  • FsHistoryProvider is created and requested to createInMemoryStore
                                                  • AppStatusStore utility is used to create an AppStatusStore for a live Spark application
                                                  "},{"location":"core/KVStore/","title":"KVStore","text":"

                                                  KVStore is an abstraction of key-value stores.

                                                  KVStore is a Java Closeable.

                                                  "},{"location":"core/KVStore/#contract","title":"Contract","text":""},{"location":"core/KVStore/#count","title":"count
                                                  long count(\n  Class<?> type)\nlong count(\n  Class<?> type,\n  String index,\n  Object indexedValue)\n
                                                  ","text":""},{"location":"core/KVStore/#delete","title":"delete
                                                  void delete(\n  Class<?> type,\n  Object naturalKey)\n
                                                  ","text":""},{"location":"core/KVStore/#getmetadata","title":"getMetadata
                                                  <T> T getMetadata(\n  Class<T> klass)\n
                                                  ","text":""},{"location":"core/KVStore/#read","title":"read
                                                  <T> T read(\n  Class<T> klass,\n  Object naturalKey)\n
                                                  ","text":""},{"location":"core/KVStore/#removeallbyindexvalues","title":"removeAllByIndexValues
                                                  <T> boolean removeAllByIndexValues(\n  Class<T> klass,\n  String index,\n  Collection<?> indexValues)\n
                                                  ","text":""},{"location":"core/KVStore/#setmetadata","title":"setMetadata
                                                  void setMetadata(\n  Object value)\n
                                                  ","text":""},{"location":"core/KVStore/#view","title":"view
                                                  <T> KVStoreView<T> view(\n  Class<T> type)\n

                                                  KVStoreView over entities of the given type

                                                  ","text":""},{"location":"core/KVStore/#write","title":"write
                                                  void write(\n  Object value)\n
                                                  ","text":""},{"location":"core/KVStore/#implementations","title":"Implementations","text":"
                                                  • ElementTrackingStore
                                                  • InMemoryStore
                                                  • LevelDB
                                                  "},{"location":"core/LevelDB/","title":"LevelDB","text":"

                                                  LevelDB is a KVStore for FsHistoryProvider.

                                                  "},{"location":"core/LevelDB/#creating-instance","title":"Creating Instance","text":"

                                                  LevelDB takes the following to be created:

                                                  • Path
                                                  • KVStoreSerializer

                                                    LevelDB is created\u00a0when:

                                                    • KVUtils utility is used to open (a LevelDB store)
                                                    "},{"location":"core/RetryingBlockFetcher/","title":"RetryingBlockFetcher","text":"

                                                    RetryingBlockFetcher is...FIXME

                                                    RetryingBlockFetcher is <> and immediately <> when:

                                                    • NettyBlockTransferService is requested to storage:NettyBlockTransferService.md#fetchBlocks[fetchBlocks] (when network:TransportConf.md#io.maxRetries[maxIORetries] is greater than 0 which it is by default)

                                                    RetryingBlockFetcher uses a <> to core:BlockFetchStarter.md#createAndStart[createAndStart] when requested to <> and later <>.

                                                    [[outstandingBlocksIds]] RetryingBlockFetcher uses outstandingBlocksIds internal registry of outstanding block IDs to fetch that is initially the <> when <>.

                                                    At <>, RetryingBlockFetcher prints out the following INFO message to the logs (with the number of <>):

                                                    Retrying fetch ([retryCount]/[maxRetries]) for [size] outstanding blocks after [retryWaitTime] ms\n

                                                    On <> and <>, <> removes the block ID from <>.

                                                    [[currentListener]] RetryingBlockFetcher uses a <> to remove block IDs from the <> internal registry.

                                                    == [[creating-instance]] Creating RetryingBlockFetcher Instance

                                                    RetryingBlockFetcher takes the following when created:

                                                    • [[conf]] network:TransportConf.md[]
                                                    • [[fetchStarter]] core:BlockFetchStarter.md[]
                                                    • [[blockIds]] Block IDs to fetch
                                                    • [[listener]] core:BlockFetchingListener.md[]

                                                    == [[start]] Starting RetryingBlockFetcher -- start Method

                                                    "},{"location":"core/RetryingBlockFetcher/#source-java","title":"[source, java]","text":""},{"location":"core/RetryingBlockFetcher/#void-start","title":"void start()","text":"

                                                    start simply <>.

                                                    start is used when:

                                                    • NettyBlockTransferService is requested to storage:NettyBlockTransferService.md#fetchBlocks[fetchBlocks] (when network:TransportConf.md#io.maxRetries[maxIORetries] is greater than 0 which it is by default)

                                                    == [[initiateRetry]] initiateRetry Internal Method

                                                    "},{"location":"core/RetryingBlockFetcher/#source-java_1","title":"[source, java]","text":""},{"location":"core/RetryingBlockFetcher/#synchronized-void-initiateretry","title":"synchronized void initiateRetry()","text":"

                                                    initiateRetry...FIXME

                                                    "},{"location":"core/RetryingBlockFetcher/#note","title":"[NOTE]","text":"

                                                    initiateRetry is used when:

                                                    • RetryingBlockFetcher is requested to <>"},{"location":"core/RetryingBlockFetcher/#retryingblockfetchlistener-is-requested-to","title":"* RetryingBlockFetchListener is requested to <>

                                                      == [[fetchAllOutstanding]] fetchAllOutstanding Internal Method

                                                      ","text":""},{"location":"core/RetryingBlockFetcher/#source-java_2","title":"[source, java]","text":""},{"location":"core/RetryingBlockFetcher/#void-fetchalloutstanding","title":"void fetchAllOutstanding()","text":"

                                                      fetchAllOutstanding requests <> to core:BlockFetchStarter.md#createAndStart[createAndStart] for the <>.

                                                      NOTE: fetchAllOutstanding is used when RetryingBlockFetcher is requested to <> and <>.

                                                      == [[RetryingBlockFetchListener]] RetryingBlockFetchListener

                                                      RetryingBlockFetchListener is a core:BlockFetchingListener.md[] that <> uses to remove block IDs from the <> internal registry.

                                                      === [[RetryingBlockFetchListener-onBlockFetchSuccess]] onBlockFetchSuccess Method

                                                      "},{"location":"core/RetryingBlockFetcher/#source-scala","title":"[source, scala]","text":""},{"location":"core/RetryingBlockFetcher/#void-onblockfetchsuccessstring-blockid-managedbuffer-data","title":"void onBlockFetchSuccess(String blockId, ManagedBuffer data)","text":"

                                                      NOTE: onBlockFetchSuccess is part of core:BlockFetchingListener.md#onBlockFetchSuccess[BlockFetchingListener Contract].

                                                      onBlockFetchSuccess...FIXME

                                                      === [[RetryingBlockFetchListener-onBlockFetchFailure]] onBlockFetchFailure Method

                                                      "},{"location":"core/RetryingBlockFetcher/#source-scala_1","title":"[source, scala]","text":""},{"location":"core/RetryingBlockFetcher/#void-onblockfetchfailurestring-blockid-throwable-exception","title":"void onBlockFetchFailure(String blockId, Throwable exception)","text":"

                                                      NOTE: onBlockFetchFailure is part of core:BlockFetchingListener.md#onBlockFetchFailure[BlockFetchingListener Contract].

                                                      onBlockFetchFailure...FIXME

                                                      "},{"location":"demo/","title":"Demos","text":"

                                                      The following demos are available:

                                                      • DiskBlockManager and Block Data
                                                      "},{"location":"demo/diskblockmanager-and-block-data/","title":"Demo: DiskBlockManager and Block Data","text":"

                                                      The demo shows how Spark stores data blocks on local disk (using DiskBlockManager and DiskStore among the services).

                                                      "},{"location":"demo/diskblockmanager-and-block-data/#configure-local-directories","title":"Configure Local Directories","text":"

                                                      Spark uses spark.local.dir configuration property for one or more local directories to store data blocks.

                                                      Start spark-shell with the property set to a directory of your choice (say local-dirs). Use one directory for easier monitoring.

                                                      $SPARK_HOME/bin/spark-shell --conf spark.local.dir=local-dirs\n

                                                      When started, Spark will create a proper directory layout. You are interested in blockmgr-[uuid] directory.

                                                      "},{"location":"demo/diskblockmanager-and-block-data/#create-data-blocks","title":"\"Create\" Data Blocks","text":"

                                                      Execute the following Spark application that forces persisting (caching) data to disk.

                                                      import org.apache.spark.storage.StorageLevel\nspark.range(2).persist(StorageLevel.DISK_ONLY).count\n
                                                      "},{"location":"demo/diskblockmanager-and-block-data/#observe-block-files","title":"Observe Block Files","text":""},{"location":"demo/diskblockmanager-and-block-data/#command-line","title":"Command Line","text":"

                                                      Go to the blockmgr-[uuid] directory and observe the block files. There should be a few. Do you know how many and why?

                                                      $ tree local-dirs/blockmgr-b7167b5a-ae8d-404b-8de2-1a0fb101fe00/\nlocal-dirs/blockmgr-b7167b5a-ae8d-404b-8de2-1a0fb101fe00/\n\u251c\u2500\u2500 00\n\u251c\u2500\u2500 04\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_8_0.data\n\u251c\u2500\u2500 06\n\u251c\u2500\u2500 08\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_8_0.index\n...\n\u251c\u2500\u2500 37\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_7_0.index\n\u251c\u2500\u2500 38\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_4_0.data\n\u251c\u2500\u2500 39\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_9_0.index\n\u2514\u2500\u2500 3a\n    \u2514\u2500\u2500 shuffle_0_6_0.data\n\n47 directories, 48 files\n
                                                      "},{"location":"demo/diskblockmanager-and-block-data/#diskblockmanager","title":"DiskBlockManager","text":"

                                                      The files are managed by DiskBlockManager that is available to access all the files as well.

                                                      import org.apache.spark.SparkEnv\nSparkEnv.get.blockManager.diskBlockManager.getAllFiles()\n
                                                      "},{"location":"demo/diskblockmanager-and-block-data/#use-web-ui","title":"Use web UI","text":"

                                                      Open http://localhost:4040 and switch to Storage tab (at http://localhost:4040/storage/). You should see one RDD cached.

                                                      Click the link in RDD Name column and review the information.

                                                      "},{"location":"demo/diskblockmanager-and-block-data/#enable-logging","title":"Enable Logging","text":"

                                                      Enable ALL logging level for org.apache.spark.storage.DiskStore and org.apache.spark.storage.DiskBlockManager loggers to have an even deeper insight on the block storage internals.

                                                      log4j.logger.org.apache.spark.storage.DiskBlockManager=ALL\nlog4j.logger.org.apache.spark.storage.DiskStore=ALL\n
                                                      "},{"location":"dynamic-allocation/","title":"Dynamic Allocation of Executors","text":"

                                                      Dynamic Allocation of Executors (Dynamic Resource Allocation or Elastic Scaling) is a Spark service for adding and removing Spark executors dynamically on demand to match workload.

                                                      Unlike the \"traditional\" static allocation where a Spark application reserves CPU and memory resources upfront (irrespective of how much it may eventually use), in dynamic allocation you get as much as needed and no more. It scales the number of executors up and down based on workload, i.e. idle executors are removed, and when there are pending tasks waiting for executors to be launched on, dynamic allocation requests them.

                                                      Dynamic Allocation is enabled (and SparkContext creates an ExecutorAllocationManager) when:

                                                      1. spark.dynamicAllocation.enabled configuration property is enabled

                                                      2. spark.master is non-local

                                                      3. SchedulerBackend is an ExecutorAllocationClient

                                                      ExecutorAllocationManager is the heart of Dynamic Resource Allocation.

                                                      When enabled, it is recommended to use the External Shuffle Service.

                                                      Dynamic Allocation comes with the policy of scaling executors up and down as follows:

                                                      1. Scale Up Policy requests new executors when there are pending tasks and increases the number of executors exponentially since executors start slow and Spark application may need slightly more.
                                                      2. Scale Down Policy removes executors that have been idle for spark.dynamicAllocation.executorIdleTimeout seconds.
                                                      "},{"location":"dynamic-allocation/#performance-metrics","title":"Performance Metrics","text":"

                                                      ExecutorAllocationManagerSource metric source is used to report performance metrics.

                                                      "},{"location":"dynamic-allocation/#sparkcontextkillexecutors","title":"SparkContext.killExecutors","text":"

                                                      SparkContext.killExecutors is unsupported with Dynamic Allocation enabled.

                                                      "},{"location":"dynamic-allocation/#programmable-dynamic-allocation","title":"Programmable Dynamic Allocation","text":"

                                                      SparkContext offers a developer API to scale executors up or down.

                                                      "},{"location":"dynamic-allocation/#getting-initial-number-of-executors-for-dynamic-allocation","title":"Getting Initial Number of Executors for Dynamic Allocation
                                                      getDynamicAllocationInitialExecutors(conf: SparkConf): Int\n

                                                      getDynamicAllocationInitialExecutors first makes sure that <> is equal or greater than <>.

                                                      NOTE: <> falls back to <> if not set. Why to print the WARN message to the logs?

                                                      If not, you should see the following WARN message in the logs:

                                                      spark.dynamicAllocation.initialExecutors less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.\n

                                                      getDynamicAllocationInitialExecutors makes sure that executor:Executor.md#spark.executor.instances[spark.executor.instances] is greater than <>.

                                                      NOTE: Both executor:Executor.md#spark.executor.instances[spark.executor.instances] and <> fall back to 0 when no defined explicitly.

                                                      If not, you should see the following WARN message in the logs:

                                                      spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.\n

                                                      getDynamicAllocationInitialExecutors sets the initial number of executors to be the maximum of:

                                                      • spark.dynamicAllocation.minExecutors
                                                      • spark.dynamicAllocation.initialExecutors
                                                      • spark.executor.instances
                                                      • 0

                                                      You should see the following INFO message in the logs:

                                                      Using initial executors = [initialExecutors], max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances\n

                                                      getDynamicAllocationInitialExecutors is used when ExecutorAllocationManager is requested to set the initial number of executors.

                                                      ","text":""},{"location":"dynamic-allocation/#resources","title":"Resources","text":""},{"location":"dynamic-allocation/#documentation","title":"Documentation","text":"
                                                      • Dynamic Allocation in the official documentation of Apache Spark
                                                      • Dynamic allocation in the documentation of Cloudera Data Platform (CDP)
                                                      "},{"location":"dynamic-allocation/#slides","title":"Slides","text":"
                                                      • Dynamic Allocation in Spark by Databricks
                                                      "},{"location":"dynamic-allocation/ExecutorAllocationClient/","title":"ExecutorAllocationClient","text":"

                                                      ExecutorAllocationClient is an abstraction of schedulers that can communicate with a cluster manager to request or kill executors.

                                                      "},{"location":"dynamic-allocation/ExecutorAllocationClient/#contract","title":"Contract","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#active-executor-ids","title":"Active Executor IDs
                                                      getExecutorIds(): Seq[String]\n

                                                      Used when:

                                                      • SparkContext is requested for active executors
                                                      ","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#isexecutoractive","title":"isExecutorActive
                                                      isExecutorActive(\n  id: String): Boolean\n

                                                      Whether a given executor (by ID) is active (and can be used to execute tasks)

                                                      Used when:

                                                      • FIXME
                                                      ","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#killing-executors","title":"Killing Executors
                                                      killExecutors(\n  executorIds: Seq[String],\n  adjustTargetNumExecutors: Boolean,\n  countFailures: Boolean,\n  force: Boolean = false): Seq[String]\n

                                                      Requests a cluster manager to kill given executors and returns whether the request has been acknowledged by the cluster manager (true) or not (false).

                                                      Used when:

                                                      • ExecutorAllocationClient is requested to kill an executor
                                                      • ExecutorAllocationManager is requested to removeExecutors
                                                      • SparkContext is requested to kill executors and killAndReplaceExecutor
                                                      • BlacklistTracker is requested to kill an executor
                                                      • DriverEndpoint is requested to handle a KillExecutorsOnHost message
                                                      ","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#killing-executors-on-host","title":"Killing Executors on Host
                                                      killExecutorsOnHost(\n  host: String): Boolean\n

                                                      Used when:

                                                      • BlacklistTracker is requested to kill executors on a blacklisted node
                                                      ","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#requesting-additional-executors","title":"Requesting Additional Executors
                                                      requestExecutors(\n  numAdditionalExecutors: Int): Boolean\n

                                                      Requests additional executors from a cluster manager and returns whether the request has been acknowledged by the cluster manager (true) or not (false).

                                                      Used when:

                                                      • SparkContext is requested for additional executors
                                                      ","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#updating-total-executors","title":"Updating Total Executors
                                                      requestTotalExecutors(\n  resourceProfileIdToNumExecutors: Map[Int, Int],\n  numLocalityAwareTasksPerResourceProfileId: Map[Int, Int],\n  hostToLocalTaskCount: Map[Int, Map[String, Int]]): Boolean\n

                                                      Updates a cluster manager with the exact number of executors desired. Returns whether the request has been acknowledged by the cluster manager (true) or not (false).

                                                      Used when:

                                                      • SparkContext is requested to update the number of total executors

                                                      • ExecutorAllocationManager is requested to start, updateAndSyncNumExecutorsTarget, addExecutors, removeExecutors

                                                      ","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#implementations","title":"Implementations","text":"
                                                      • CoarseGrainedSchedulerBackend
                                                      • KubernetesClusterSchedulerBackend (Spark on Kubernetes)
                                                      • MesosCoarseGrainedSchedulerBackend
                                                      • StandaloneSchedulerBackend ([Spark Standalone]https://books.japila.pl/spark-standalone-internals/StandaloneSchedulerBackend))
                                                      • YarnSchedulerBackend
                                                      "},{"location":"dynamic-allocation/ExecutorAllocationClient/#killing-single-executor","title":"Killing Single Executor
                                                      killExecutor(\n  executorId: String): Boolean\n

                                                      killExecutor kill the given executor.

                                                      killExecutor\u00a0is used when:

                                                      • ExecutorAllocationManager removes an executor.
                                                      • SparkContext is requested to kill executors.
                                                      ","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#decommissioning-executors","title":"Decommissioning Executors
                                                      decommissionExecutors(\n  executorsAndDecomInfo: Array[(String, ExecutorDecommissionInfo)],\n  adjustTargetNumExecutors: Boolean,\n  triggeredByExecutor: Boolean): Seq[String]\n

                                                      decommissionExecutors kills the given executors.

                                                      decommissionExecutors\u00a0is used when:

                                                      • ExecutorAllocationClient is requested to decommission a single executor
                                                      • ExecutorAllocationManager is requested to remove executors
                                                      • StandaloneSchedulerBackend (Spark Standalone) is requested to executorDecommissioned
                                                      ","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#decommissioning-single-executor","title":"Decommissioning Single Executor
                                                      decommissionExecutor(\n  executorId: String,\n  decommissionInfo: ExecutorDecommissionInfo,\n  adjustTargetNumExecutors: Boolean,\n  triggeredByExecutor: Boolean = false): Boolean\n

                                                      decommissionExecutor...FIXME

                                                      decommissionExecutor\u00a0is used when:

                                                      • DriverEndpoint is requested to handle a ExecutorDecommissioning message
                                                      ","text":""},{"location":"dynamic-allocation/ExecutorAllocationListener/","title":"ExecutorAllocationListener","text":"

                                                      ExecutorAllocationListener is a SparkListener.md[] that intercepts events about stages, tasks, and executors, i.e. onStageSubmitted, onStageCompleted, onTaskStart, onTaskEnd, onExecutorAdded, and onExecutorRemoved. Using the events ExecutorAllocationManager can manage the pool of dynamically managed executors.

                                                      Internal Class

                                                      ExecutorAllocationListener is an internal class of ExecutorAllocationManager with full access to internal registries.

                                                      "},{"location":"dynamic-allocation/ExecutorAllocationManager/","title":"ExecutorAllocationManager","text":"

                                                      ExecutorAllocationManager can be used to dynamically allocate executors based on processing workload.

                                                      ExecutorAllocationManager intercepts Spark events using the internal ExecutorAllocationListener that keeps track of the workload.

                                                      "},{"location":"dynamic-allocation/ExecutorAllocationManager/#creating-instance","title":"Creating Instance","text":"

                                                      ExecutorAllocationManager takes the following to be created:

                                                      • ExecutorAllocationClient
                                                      • LiveListenerBus
                                                      • SparkConf
                                                      • ContextCleaner (default: None)
                                                      • Clock (default: SystemClock)

                                                        ExecutorAllocationManager is created (and started) when SparkContext is created (with Dynamic Allocation of Executors enabled)

                                                        "},{"location":"dynamic-allocation/ExecutorAllocationManager/#validating-configuration","title":"Validating Configuration
                                                        validateSettings(): Unit\n

                                                        validateSettings makes sure that the settings for dynamic allocation are correct.

                                                        validateSettings throws a SparkException when the following are not met:

                                                        • spark.dynamicAllocation.minExecutors must be positive

                                                        • spark.dynamicAllocation.maxExecutors must be 0 or greater

                                                        • spark.dynamicAllocation.minExecutors must be less than or equal to spark.dynamicAllocation.maxExecutors

                                                        • spark.dynamicAllocation.executorIdleTimeout must be greater than 0

                                                        • spark.shuffle.service.enabled must be enabled.

                                                        • The number of tasks per core, i.e. spark.executor.cores divided by spark.task.cpus, is not zero.

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#performance-metrics","title":"Performance Metrics","text":"

                                                        ExecutorAllocationManager uses ExecutorAllocationManagerSource for performance metrics.

                                                        "},{"location":"dynamic-allocation/ExecutorAllocationManager/#executormonitor","title":"ExecutorMonitor

                                                        ExecutorAllocationManager creates an ExecutorMonitor when created.

                                                        ExecutorMonitor is added to the management queue (of LiveListenerBus) when ExecutorAllocationManager is started.

                                                        ExecutorMonitor is attached (to the ContextCleaner) when ExecutorAllocationManager is started.

                                                        ExecutorMonitor is requested to reset when ExecutorAllocationManager is requested to reset.

                                                        ExecutorMonitor is used for the performance metrics:

                                                        • numberExecutorsPendingToRemove (based on pendingRemovalCount)
                                                        • numberAllExecutors (based on executorCount)

                                                        ExecutorMonitor is used for the following:

                                                        • timedOutExecutors when ExecutorAllocationManager is requested to schedule
                                                        • executorCount when ExecutorAllocationManager is requested to addExecutors
                                                        • executorCount, pendingRemovalCount and executorsKilled when ExecutorAllocationManager is requested to removeExecutors
                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#executorallocationlistener","title":"ExecutorAllocationListener

                                                        ExecutorAllocationManager creates an ExecutorAllocationListener when created to intercept Spark events that impact the allocation policy.

                                                        ExecutorAllocationListener is added to the management queue (of LiveListenerBus) when ExecutorAllocationManager is started.

                                                        ExecutorAllocationListener is used to calculate the maximum number of executors needed.

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#sparkdynamicallocationexecutorallocationratio","title":"spark.dynamicAllocation.executorAllocationRatio

                                                        ExecutorAllocationManager uses spark.dynamicAllocation.executorAllocationRatio configuration property for maxNumExecutorsNeeded.

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#tasksperexecutorforfullparallelism","title":"tasksPerExecutorForFullParallelism

                                                        ExecutorAllocationManager uses spark.executor.cores and spark.task.cpus configuration properties for the number of tasks that can be submitted to an executor for full parallelism.

                                                        Used when:

                                                        • maxNumExecutorsNeeded
                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#maximum-number-of-executors-needed","title":"Maximum Number of Executors Needed
                                                        maxNumExecutorsNeeded(): Int\n

                                                        maxNumExecutorsNeeded requests the ExecutorAllocationListener for the number of pending and running tasks.

                                                        maxNumExecutorsNeeded is the smallest integer value that is greater than or equal to the multiplication of the total number of pending and running tasks by executorAllocationRatio divided by tasksPerExecutorForFullParallelism.

                                                        maxNumExecutorsNeeded is used for:

                                                        • updateAndSyncNumExecutorsTarget
                                                        • numberMaxNeededExecutors performance metric
                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#executorallocationclient","title":"ExecutorAllocationClient

                                                        ExecutorAllocationManager is given an ExecutorAllocationClient when created.

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#starting-executorallocationmanager","title":"Starting ExecutorAllocationManager
                                                        start(): Unit\n

                                                        start requests the LiveListenerBus to add to the management queue:

                                                        • ExecutorAllocationListener
                                                        • ExecutorMonitor

                                                        start requests the ContextCleaner (if defined) to attach the ExecutorMonitor.

                                                        creates a scheduleTask (a Java Runnable) for schedule when started.

                                                        start requests the ScheduledExecutorService to schedule the scheduleTask every 100 ms.

                                                        Note

                                                        The schedule delay of 100 is not configurable.

                                                        start requests the ExecutorAllocationClient to request the total executors with the following:

                                                        • numExecutorsTarget
                                                        • localityAwareTasks
                                                        • hostToLocalTaskCount

                                                        start is used when SparkContext is created.

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#scheduling-executors","title":"Scheduling Executors
                                                        schedule(): Unit\n

                                                        schedule requests the ExecutorMonitor for timedOutExecutors.

                                                        If there are executors to be removed, schedule turns the initializing internal flag off.

                                                        schedule updateAndSyncNumExecutorsTarget with the current time.

                                                        In the end, schedule removes the executors to be removed if there are any.

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#updateandsyncnumexecutorstarget","title":"updateAndSyncNumExecutorsTarget
                                                        updateAndSyncNumExecutorsTarget(\n  now: Long): Int\n

                                                        updateAndSyncNumExecutorsTarget maxNumExecutorsNeeded.

                                                        updateAndSyncNumExecutorsTarget...FIXME

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#stopping-executorallocationmanager","title":"Stopping ExecutorAllocationManager
                                                        stop(): Unit\n

                                                        stop shuts down <>.

                                                        Note

                                                        stop waits 10 seconds for the termination to be complete.

                                                        stop is used when SparkContext is requested to stop

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#spark-dynamic-executor-allocation-allocation-executor","title":"spark-dynamic-executor-allocation Allocation Executor

                                                        spark-dynamic-executor-allocation allocation executor is a...FIXME

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#executorallocationmanagersource","title":"ExecutorAllocationManagerSource

                                                        ExecutorAllocationManagerSource

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#removing-executors","title":"Removing Executors
                                                        removeExecutors(\n  executors: Seq[(String, Int)]): Seq[String]\n

                                                        removeExecutors...FIXME

                                                        removeExecutors\u00a0is used when:

                                                        • ExecutorAllocationManager is requested to schedule executors
                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#logging","title":"Logging

                                                        Enable ALL logging level for org.apache.spark.ExecutorAllocationManager logger to see what happens inside.

                                                        Add the following line to conf/log4j.properties:

                                                        log4j.logger.org.apache.spark.ExecutorAllocationManager=ALL\n

                                                        Refer to Logging.

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/","title":"ExecutorAllocationManagerSource","text":"

                                                        ExecutorAllocationManagerSource is a metric source for Dynamic Allocation of Executors.

                                                        "},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#source-name","title":"Source Name

                                                        ExecutorAllocationManagerSource is registered under the name ExecutorAllocationManager.

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#gauges","title":"Gauges","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numberexecutorstoadd","title":"numberExecutorsToAdd

                                                        executors/numberExecutorsToAdd for numExecutorsToAdd

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numberexecutorspendingtoremove","title":"numberExecutorsPendingToRemove

                                                        executors/numberExecutorsPendingToRemove for pendingRemovalCount

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numberallexecutors","title":"numberAllExecutors

                                                        executors/numberAllExecutors for executorCount

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numbertargetexecutors","title":"numberTargetExecutors

                                                        executors/numberTargetExecutors for numExecutorsTarget

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numbermaxneededexecutors","title":"numberMaxNeededExecutors

                                                        executors/numberMaxNeededExecutors for maxNumExecutorsNeeded

                                                        ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/","title":"ExecutorMonitor","text":"

                                                        ExecutorMonitor is a SparkListener and a CleanerListener.

                                                        "},{"location":"dynamic-allocation/ExecutorMonitor/#creating-instance","title":"Creating Instance","text":"

                                                        ExecutorMonitor takes the following to be created:

                                                        • SparkConf
                                                        • ExecutorAllocationClient
                                                        • LiveListenerBus
                                                        • Clock

                                                          ExecutorMonitor is created\u00a0when:

                                                          • ExecutorAllocationManager is created
                                                          "},{"location":"dynamic-allocation/ExecutorMonitor/#shuffleids-registry","title":"shuffleIds Registry
                                                          shuffleIds: Set[Int]\n

                                                          ExecutorMonitor uses a mutable HashSet to track shuffle IDs...FIXME

                                                          shuffleIds is initialized only when shuffleTrackingEnabled is enabled.

                                                          shuffleIds is used by Tracker internal class for the following:

                                                          • updateTimeout, addShuffle, removeShuffle and updateActiveShuffles
                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#executors-registry","title":"Executors Registry
                                                          executors: ConcurrentHashMap[String, Tracker]\n

                                                          ExecutorMonitor uses a Java ConcurrentHashMap to track available executors.

                                                          An executor is added when (via ensureExecutorIsTracked):

                                                          • onBlockUpdated
                                                          • onExecutorAdded
                                                          • onTaskStart

                                                          An executor is removed when onExecutorRemoved.

                                                          All executors are removed when reset.

                                                          executors is used when:

                                                          • onOtherEvent (cleanupShuffle)
                                                          • executorCount
                                                          • executorsKilled
                                                          • onUnpersistRDD
                                                          • onTaskEnd
                                                          • onJobStart
                                                          • onJobEnd
                                                          • pendingRemovalCount
                                                          • timedOutExecutors
                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#fetchfromshufflesvcenabled-flag","title":"fetchFromShuffleSvcEnabled Flag
                                                          fetchFromShuffleSvcEnabled: Boolean\n

                                                          ExecutorMonitor initializes fetchFromShuffleSvcEnabled internal flag based on the values of spark.shuffle.service.enabled and spark.shuffle.service.fetch.rdd.enabled configuration properties.

                                                          fetchFromShuffleSvcEnabled is enabled (true) when the aforementioned configuration properties are.

                                                          fetchFromShuffleSvcEnabled is used when:

                                                          • onBlockUpdated
                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#shuffletrackingenabled-flag","title":"shuffleTrackingEnabled Flag
                                                          shuffleTrackingEnabled: Boolean\n

                                                          ExecutorMonitor initializes shuffleTrackingEnabled internal flag based on the values of spark.shuffle.service.enabled and spark.dynamicAllocation.shuffleTracking.enabled configuration properties.

                                                          shuffleTrackingEnabled is enabled (true) when the following holds:

                                                          1. spark.shuffle.service.enabled is disabled
                                                          2. spark.dynamicAllocation.shuffleTracking.enabled is enabled

                                                          When enabled, shuffleTrackingEnabled is used to skip execution of the following (making them noops):

                                                          • onJobStart
                                                          • onJobEnd

                                                          When disabled, shuffleTrackingEnabled is used for the following:

                                                          • onTaskEnd
                                                          • shuffleCleaned
                                                          • shuffleIds
                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#sparkdynamicallocationcachedexecutoridletimeout","title":"spark.dynamicAllocation.cachedExecutorIdleTimeout

                                                          ExecutorMonitor reads spark.dynamicAllocation.cachedExecutorIdleTimeout configuration property for Tracker to updateTimeout.

                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onblockupdated","title":"onBlockUpdated
                                                          onBlockUpdated(\n  event: SparkListenerBlockUpdated): Unit\n

                                                          onBlockUpdated\u00a0is part of the SparkListenerInterface abstraction.

                                                          onBlockUpdated...FIXME

                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onexecutoradded","title":"onExecutorAdded
                                                          onExecutorAdded(\n  event: SparkListenerExecutorAdded): Unit\n

                                                          onExecutorAdded\u00a0is part of the SparkListenerInterface abstraction.

                                                          onExecutorAdded...FIXME

                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onexecutorremoved","title":"onExecutorRemoved
                                                          onExecutorRemoved(\n  event: SparkListenerExecutorRemoved): Unit\n

                                                          onExecutorRemoved\u00a0is part of the SparkListenerInterface abstraction.

                                                          onExecutorRemoved...FIXME

                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onjobend","title":"onJobEnd
                                                          onJobEnd(\n  event: SparkListenerJobEnd): Unit\n

                                                          onJobEnd\u00a0is part of the SparkListenerInterface abstraction.

                                                          onJobEnd...FIXME

                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onjobstart","title":"onJobStart
                                                          onJobStart(\n  event: SparkListenerJobStart): Unit\n

                                                          onJobStart\u00a0is part of the SparkListenerInterface abstraction.

                                                          Note

                                                          onJobStart does nothing and simply returns when the shuffleTrackingEnabled flag is turned off (false).

                                                          onJobStart requests the input SparkListenerJobStart for the StageInfos and converts...FIXME

                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onotherevent","title":"onOtherEvent
                                                          onOtherEvent(\n  event: SparkListenerEvent): Unit\n

                                                          onOtherEvent\u00a0is part of the SparkListenerInterface abstraction.

                                                          onOtherEvent...FIXME

                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#cleanupshuffle","title":"cleanupShuffle
                                                          cleanupShuffle(\n  id: Int): Unit\n

                                                          cleanupShuffle...FIXME

                                                          cleanupShuffle\u00a0is used when onOtherEvent

                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#ontaskend","title":"onTaskEnd
                                                          onTaskEnd(\n  event: SparkListenerTaskEnd): Unit\n

                                                          onTaskEnd\u00a0is part of the SparkListenerInterface abstraction.

                                                          onTaskEnd...FIXME

                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#ontaskstart","title":"onTaskStart
                                                          onTaskStart(\n  event: SparkListenerTaskStart): Unit\n

                                                          onTaskStart\u00a0is part of the SparkListenerInterface abstraction.

                                                          onTaskStart...FIXME

                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onunpersistrdd","title":"onUnpersistRDD
                                                          onUnpersistRDD(\n  event: SparkListenerUnpersistRDD): Unit\n

                                                          onUnpersistRDD\u00a0is part of the SparkListenerInterface abstraction.

                                                          onUnpersistRDD...FIXME

                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#reset","title":"reset
                                                          reset(): Unit\n

                                                          reset...FIXME

                                                          reset\u00a0is used when:

                                                          • FIXME
                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#shufflecleaned","title":"shuffleCleaned
                                                          shuffleCleaned(\n  shuffleId: Int): Unit\n

                                                          shuffleCleaned\u00a0is part of the CleanerListener abstraction.

                                                          shuffleCleaned...FIXME

                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#timedoutexecutors","title":"timedOutExecutors
                                                          timedOutExecutors(): Seq[String]\ntimedOutExecutors(\n  when: Long): Seq[String]\n

                                                          timedOutExecutors...FIXME

                                                          timedOutExecutors\u00a0is used when:

                                                          • ExecutorAllocationManager is requested to schedule
                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#executorcount","title":"executorCount
                                                          executorCount: Int\n

                                                          executorCount...FIXME

                                                          executorCount\u00a0is used when:

                                                          • ExecutorAllocationManager is requested to addExecutors and removeExecutors
                                                          • ExecutorAllocationManagerSource is requested for numberAllExecutors performance metric
                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#pendingremovalcount","title":"pendingRemovalCount
                                                          pendingRemovalCount: Int\n

                                                          pendingRemovalCount...FIXME

                                                          pendingRemovalCount\u00a0is used when:

                                                          • ExecutorAllocationManager is requested to removeExecutors
                                                          • ExecutorAllocationManagerSource is requested for numberExecutorsPendingToRemove performance metric
                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#executorskilled","title":"executorsKilled
                                                          executorsKilled(\n  ids: Seq[String]): Unit\n

                                                          executorsKilled...FIXME

                                                          executorsKilled\u00a0is used when:

                                                          • ExecutorAllocationManager is requested to removeExecutors
                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#ensureexecutoristracked","title":"ensureExecutorIsTracked
                                                          ensureExecutorIsTracked(\n  id: String,\n  resourceProfileId: Int): Tracker\n

                                                          ensureExecutorIsTracked...FIXME

                                                          ensureExecutorIsTracked\u00a0is used when:

                                                          • onBlockUpdated
                                                          • onExecutorAdded
                                                          • onTaskStart
                                                          ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#getresourceprofileid","title":"getResourceProfileId
                                                          getResourceProfileId(\n  executorId: String): Int\n

                                                          getResourceProfileId...FIXME

                                                          getResourceProfileId\u00a0is used for testing only.

                                                          ","text":""},{"location":"dynamic-allocation/Tracker/","title":"Tracker","text":"

                                                          Tracker is a private internal class of ExecutorMonitor.

                                                          "},{"location":"dynamic-allocation/Tracker/#creating-instance","title":"Creating Instance","text":"

                                                          Tracker takes the following to be created:

                                                          • resourceProfileId

                                                            Tracker is created\u00a0when:

                                                            • ExecutorMonitor is requested to ensureExecutorIsTracked
                                                            "},{"location":"dynamic-allocation/Tracker/#cachedblocks-internal-registry","title":"cachedBlocks Internal Registry
                                                            cachedBlocks: Map[Int, BitSet]\n

                                                            Tracker uses cachedBlocks internal registry for cached blocks (RDD IDs and partition IDs stored in an executor).

                                                            cachedBlocks is used when:

                                                            • ExecutorMonitor is requested to onBlockUpdated, onUnpersistRDD
                                                            • Tracker is requested to updateTimeout
                                                            ","text":""},{"location":"dynamic-allocation/Tracker/#removeshuffle","title":"removeShuffle
                                                            removeShuffle(\n  id: Int): Unit\n

                                                            removeShuffle...FIXME

                                                            removeShuffle\u00a0is used when:

                                                            • ExecutorMonitor is requested to cleanupShuffle
                                                            ","text":""},{"location":"dynamic-allocation/Tracker/#updateactiveshuffles","title":"updateActiveShuffles
                                                            updateActiveShuffles(\n  ids: Iterable[Int]): Unit\n

                                                            updateActiveShuffles...FIXME

                                                            updateActiveShuffles\u00a0is used when:

                                                            • ExecutorMonitor is requested to onJobStart and onJobEnd
                                                            ","text":""},{"location":"dynamic-allocation/Tracker/#updaterunningtasks","title":"updateRunningTasks
                                                            updateRunningTasks(\n  delta: Int): Unit\n

                                                            updateRunningTasks...FIXME

                                                            updateRunningTasks\u00a0is used when:

                                                            • ExecutorMonitor is requested to onTaskStart, onTaskEnd and onExecutorAdded
                                                            ","text":""},{"location":"dynamic-allocation/Tracker/#updatetimeout","title":"updateTimeout
                                                            updateTimeout(): Unit\n

                                                            updateTimeout...FIXME

                                                            updateTimeout\u00a0is used when:

                                                            • ExecutorMonitor is requested to onBlockUpdated and onUnpersistRDD
                                                            • Tracker is requested to updateRunningTasks, removeShuffle, updateActiveShuffles
                                                            ","text":""},{"location":"dynamic-allocation/configuration-properties/","title":"Spark Configuration Properties","text":""},{"location":"dynamic-allocation/configuration-properties/#sparkdynamicallocation","title":"spark.dynamicAllocation","text":""},{"location":"dynamic-allocation/configuration-properties/#cachedexecutoridletimeout","title":"cachedExecutorIdleTimeout

                                                            spark.dynamicAllocation.cachedExecutorIdleTimeout

                                                            How long (in seconds) to keep blocks cached

                                                            Default: The largest value representable as an Int

                                                            Must be >= 0

                                                            Used when:

                                                            • ExecutorMonitor is created
                                                            • RDD is requested to localCheckpoint (simply to print out a WARN message)
                                                            ","text":""},{"location":"dynamic-allocation/configuration-properties/#enabled","title":"enabled

                                                            spark.dynamicAllocation.enabled

                                                            Enables Dynamic Allocation of Executors

                                                            Default: false

                                                            Used when:

                                                            • BarrierJobAllocationFailed is requested for ERROR_MESSAGE_RUN_BARRIER_WITH_DYN_ALLOCATION (for reporting purposes)
                                                            • RDD is requested to localCheckpoint (for reporting purposes)
                                                            • SparkSubmitArguments is requested to loadEnvironmentArguments (for validation purposes)
                                                            • Utils is requested to isDynamicAllocationEnabled
                                                            ","text":""},{"location":"dynamic-allocation/configuration-properties/#executorallocationratio","title":"executorAllocationRatio

                                                            spark.dynamicAllocation.executorAllocationRatio

                                                            Default: 1.0

                                                            Must be between 0 (exclusive) and 1.0 (inclusive)

                                                            Used when:

                                                            • ExecutorAllocationManager is created
                                                            ","text":""},{"location":"dynamic-allocation/configuration-properties/#executoridletimeout","title":"executorIdleTimeout

                                                            spark.dynamicAllocation.executorIdleTimeout

                                                            Default: 60

                                                            ","text":""},{"location":"dynamic-allocation/configuration-properties/#initialexecutors","title":"initialExecutors

                                                            spark.dynamicAllocation.initialExecutors

                                                            Default: spark.dynamicAllocation.minExecutors

                                                            ","text":""},{"location":"dynamic-allocation/configuration-properties/#maxexecutors","title":"maxExecutors

                                                            spark.dynamicAllocation.maxExecutors

                                                            Default: Int.MaxValue

                                                            ","text":""},{"location":"dynamic-allocation/configuration-properties/#minexecutors","title":"minExecutors

                                                            spark.dynamicAllocation.minExecutors

                                                            Default: 0

                                                            ","text":""},{"location":"dynamic-allocation/configuration-properties/#schedulerbacklogtimeout","title":"schedulerBacklogTimeout

                                                            spark.dynamicAllocation.schedulerBacklogTimeout

                                                            (in seconds)

                                                            Default: 1

                                                            ","text":""},{"location":"dynamic-allocation/configuration-properties/#shuffletrackingenabled","title":"shuffleTracking.enabled

                                                            spark.dynamicAllocation.shuffleTracking.enabled

                                                            Default: false

                                                            Used when:

                                                            • ExecutorMonitor is created
                                                            ","text":""},{"location":"dynamic-allocation/configuration-properties/#shuffletrackingtimeout","title":"shuffleTracking.timeout

                                                            spark.dynamicAllocation.shuffleTracking.timeout

                                                            (in millis)

                                                            Default: The largest value representable as an Int

                                                            ","text":""},{"location":"dynamic-allocation/configuration-properties/#sustainedschedulerbacklogtimeout","title":"sustainedSchedulerBacklogTimeout

                                                            spark.dynamicAllocation.sustainedSchedulerBacklogTimeout

                                                            Default: spark.dynamicAllocation.schedulerBacklogTimeout

                                                            ","text":""},{"location":"executor/","title":"Executor","text":"

                                                            Spark applications start one or more Executors for executing tasks.

                                                            By default (in Static Allocation of Executors) executors run for the entire lifetime of a Spark application (unlike in Dynamic Allocation).

                                                            Executors are managed by ExecutorBackend.

                                                            Executors reports heartbeat and partial metrics for active tasks to the HeartbeatReceiver RPC Endpoint on the driver.

                                                            Executors provide in-memory storage for RDDs that are cached in Spark applications (via BlockManager).

                                                            When started, an executor first registers itself with the driver that establishes a communication channel directly to the driver to accept tasks for execution.

                                                            Executor offers are described by executor id and the host on which an executor runs.

                                                            Executors can run multiple tasks over their lifetime, both in parallel and sequentially, and track running tasks.

                                                            Executors use an Executor task launch worker thread pool for launching tasks.

                                                            Executors send metrics (and heartbeats) using the Heartbeat Sender Thread.

                                                            "},{"location":"executor/CoarseGrainedExecutorBackend/","title":"CoarseGrainedExecutorBackend","text":"

                                                            CoarseGrainedExecutorBackend is an ExecutorBackend that controls the lifecycle of a single executor.

                                                            CoarseGrainedExecutorBackend is an IsolatedThreadSafeRpcEndpoint that connects to the driver (before accepting messages) and shuts down when the driver disconnects.

                                                            CoarseGrainedExecutorBackend can receive the following messages:

                                                            • DecommissionExecutor
                                                            • KillTask
                                                            • LaunchTask
                                                            • RegisteredExecutor
                                                            • Shutdown
                                                            • StopExecutor
                                                            • UpdateDelegationTokens

                                                            When launched, CoarseGrainedExecutorBackend immediately connects to the parent CoarseGrainedSchedulerBackend (to inform that it is ready to launch tasks).

                                                            CoarseGrainedExecutorBackend registers the Executor RPC endpoint to communicate with the driver (with DriverEndpoint).

                                                            CoarseGrainedExecutorBackend sends regular executor status updates to the driver (to keep the Spark scheduler updated on the number of CPU cores free for task scheduling).

                                                            CoarseGrainedExecutorBackend is started in a resource container (as a standalone application).

                                                            "},{"location":"executor/CoarseGrainedExecutorBackend/#creating-instance","title":"Creating Instance","text":"

                                                            CoarseGrainedExecutorBackend takes the following to be created:

                                                            • RpcEnv
                                                            • Driver URL
                                                            • Executor ID
                                                            • Bind Address (unused)
                                                            • Hostname
                                                            • Number of CPU cores
                                                            • SparkEnv
                                                            • Resources Configuration File
                                                            • ResourceProfile

                                                              Note

                                                              driverUrl, executorId, hostname, cores and userClassPath correspond to CoarseGrainedExecutorBackend standalone application's command-line arguments.

                                                              CoarseGrainedExecutorBackend is created upon launching CoarseGrainedExecutorBackend standalone application.

                                                              "},{"location":"executor/CoarseGrainedExecutorBackend/#executor","title":"Executor","text":"

                                                              CoarseGrainedExecutorBackend manages the lifecycle of a single Executor:

                                                              • An Executor is created upon receiving a RegisteredExecutor message
                                                              • Stopped upon receiving a Shutdown message (that happens on a separate CoarseGrainedExecutorBackend-stop-executor thread)

                                                              The Executor is used for the following:

                                                              • decommissionSelf
                                                              • Launching a task (upon receiving a LaunchTask message)
                                                              • Killing a task (upon receiving a KillTask message)
                                                              • Reporting the number of CPU cores used for a given task in statusUpdate
                                                              "},{"location":"executor/CoarseGrainedExecutorBackend/#statusUpdate","title":"Reporting Task Status","text":"ExecutorBackend
                                                              statusUpdate(\n  taskId: Long,\n  state: TaskState,\n  data: ByteBuffer): Unit\n

                                                              statusUpdate is part of the ExecutorBackend abstraction.

                                                              statusUpdate...FIXME

                                                              "},{"location":"executor/CoarseGrainedExecutorBackend/#onStart","title":"Starting Up","text":"RpcEndpoint
                                                              onStart(): Unit\n

                                                              onStart is part of the RpcEndpoint abstraction.

                                                              With spark.decommission.enabled enabled, onStart...FIXME

                                                              onStart prints out the following INFO message to the logs (with the driverUrl):

                                                              Connecting to driver: [driverUrl]\n

                                                              onStart builds a transport-related configuration for shuffle module.

                                                              onStart parseOrFindResources in the given resourcesFileOpt, if defined, and initializes the _resources internal registry (of ResourceInformations).

                                                              onStart asyncSetupEndpointRefByURI (with the given driverUrl).

                                                              If successful, onStart initializes the driver internal registry.

                                                              onStart makes this CoarseGrainedExecutorBackend available to other Spark services using the executorBackend registry.

                                                              onStart sends a blocking RegisterExecutor message. If successful, onStart sends a RegisteredExecutor (to itself).

                                                              In case of any failure, onStart terminates this CoarseGrainedExecutorBackend with the error code 1 and the following reason (with no notification to the driver):

                                                              Cannot register with driver: [driverUrl]\n
                                                              "},{"location":"executor/CoarseGrainedExecutorBackend/#messages","title":"Messages","text":""},{"location":"executor/CoarseGrainedExecutorBackend/#DecommissionExecutor","title":"DecommissionExecutor","text":"

                                                              DecommissionExecutor is sent out when CoarseGrainedSchedulerBackend is requested to decommissionExecutors

                                                              When received, CoarseGrainedExecutorBackend decommissionSelf.

                                                              "},{"location":"executor/CoarseGrainedExecutorBackend/#RegisteredExecutor","title":"RegisteredExecutor","text":"

                                                              When received, CoarseGrainedExecutorBackend prints out the following INFO message to the logs:

                                                              Successfully registered with driver\n

                                                              CoarseGrainedExecutorBackend initializes the single managed Executor (with the given executorId, the hostname) and sends a LaunchedExecutor message back to the driver.

                                                              RegisteredExecutor is sent out when CoarseGrainedSchedulerBackend has finished onStart successfully (and registered with the driver).

                                                              "},{"location":"executor/CoarseGrainedExecutorBackend/#logging","title":"Logging","text":"

                                                              Enable ALL logging level for org.apache.spark.executor.CoarseGrainedExecutorBackend logger to see what happens inside.

                                                              Add the following line to conf/log4j2.properties:

                                                              logger.CoarseGrainedExecutorBackend.name = org.apache.spark.executor.CoarseGrainedExecutorBackend\nlogger.CoarseGrainedExecutorBackend.level = all\n

                                                              Refer to Logging.

                                                              "},{"location":"executor/Executor/","title":"Executor","text":""},{"location":"executor/Executor/#creating-instance","title":"Creating Instance","text":"

                                                              Executor takes the following to be created:

                                                              • Executor ID
                                                              • Host name
                                                              • SparkEnv
                                                              • User-defined jars
                                                              • isLocal flag
                                                              • UncaughtExceptionHandler (default: SparkUncaughtExceptionHandler)
                                                              • Resources (Map[String, ResourceInformation])

                                                                Executor is created\u00a0when:

                                                                • CoarseGrainedExecutorBackend is requested to handle a RegisteredExecutor message (after having registered with the driver)
                                                                • LocalEndpoint is created
                                                                "},{"location":"executor/Executor/#when-created","title":"When Created","text":"

                                                                When created, Executor prints out the following INFO messages to the logs:

                                                                Starting executor ID [executorId] on host [executorHostname]\n

                                                                (only for non-local modes) Executor sets SparkUncaughtExceptionHandler as the default handler invoked when a thread abruptly terminates due to an uncaught exception.

                                                                (only for non-local modes) Executor requests the BlockManager to initialize (with the Spark application id of the SparkConf).

                                                                (only for non-local modes) Executor requests the MetricsSystem to register the following metric sources:

                                                                • ExecutorSource
                                                                • JVMCPUSource
                                                                • ExecutorMetricsSource
                                                                • ShuffleMetricsSource (of the BlockManager)

                                                                Executor uses SparkEnv to access the MetricsSystem and BlockManager.

                                                                Executor creates a task class loader (optionally with REPL support) and requests the system Serializer to use as the default classloader (for deserializing tasks).

                                                                Executor starts sending heartbeats with the metrics of active tasks.

                                                                "},{"location":"executor/Executor/#plugincontainer","title":"PluginContainer

                                                                Executor creates a PluginContainer (with the SparkEnv and the resources).

                                                                The PluginContainer is used to create a TaskRunner for launching a task.

                                                                The PluginContainer is requested to shutdown in stop.

                                                                ","text":""},{"location":"executor/Executor/#executorsource","title":"ExecutorSource

                                                                When created, Executor creates an ExecutorSource (with the threadPool, the executorId and the schemes).

                                                                The ExecutorSource is then registered with the application's MetricsSystem (in local and non-local modes) to report metrics.

                                                                The metrics are updated right after a TaskRunner has finished executing a task.

                                                                ","text":""},{"location":"executor/Executor/#executormetricssource","title":"ExecutorMetricsSource

                                                                Executor creates an ExecutorMetricsSource when created with the spark.metrics.executorMetricsSource.enabled enabled.

                                                                Executor uses the ExecutorMetricsSource to create the ExecutorMetricsPoller.

                                                                Executor requests the ExecutorMetricsSource to register immediately when created with the isLocal flag disabled.

                                                                ","text":""},{"location":"executor/Executor/#executormetricspoller","title":"ExecutorMetricsPoller

                                                                Executor creates an ExecutorMetricsPoller when created with the following:

                                                                • MemoryManager of the SparkEnv
                                                                • spark.executor.metrics.pollingInterval
                                                                • ExecutorMetricsSource

                                                                Executor requests the ExecutorMetricsPoller to start immediately when created and to stop when requested to stop.

                                                                TaskRunner requests the ExecutorMetricsPoller to onTaskStart and onTaskCompletion at the beginning and the end of run, respectively.

                                                                When requested to reportHeartBeat with pollOnHeartbeat enabled, Executor requests the ExecutorMetricsPoller to poll.

                                                                ","text":""},{"location":"executor/Executor/#fetching-file-and-jar-dependencies","title":"Fetching File and Jar Dependencies
                                                                updateDependencies(\n  newFiles: Map[String, Long],\n  newJars: Map[String, Long]): Unit\n

                                                                updateDependencies fetches missing or outdated extra files (in the given newFiles). For every name-timestamp pair that...FIXME..., updateDependencies prints out the following INFO message to the logs:

                                                                Fetching [name] with timestamp [timestamp]\n

                                                                updateDependencies fetches missing or outdated extra jars (in the given newJars). For every name-timestamp pair that...FIXME..., updateDependencies prints out the following INFO message to the logs:

                                                                Fetching [name] with timestamp [timestamp]\n

                                                                updateDependencies fetches the file to the SparkFiles root directory.

                                                                updateDependencies...FIXME

                                                                updateDependencies is used when:

                                                                • TaskRunner is requested to start (and run a task)
                                                                ","text":""},{"location":"executor/Executor/#sparkdrivermaxresultsize","title":"spark.driver.maxResultSize

                                                                Executor uses the spark.driver.maxResultSize for TaskRunner when requested to run a task (and decide on a serialized task result).

                                                                ","text":""},{"location":"executor/Executor/#maximum-size-of-direct-results","title":"Maximum Size of Direct Results

                                                                Executor uses the minimum of spark.task.maxDirectResultSize and spark.rpc.message.maxSize when TaskRunner is requested to run a task (and decide on the type of a serialized task result).

                                                                ","text":""},{"location":"executor/Executor/#islocal-flag","title":"isLocal Flag

                                                                Executor is given the isLocal flag when created to indicate a non-local mode (whether the executor and the Spark application runs with local or cluster-specific master URL).

                                                                isLocal is disabled (false) by default and is off explicitly when CoarseGrainedExecutorBackend is requested to handle a RegisteredExecutor message.

                                                                isLocal is enabled (true) when LocalEndpoint is created

                                                                ","text":""},{"location":"executor/Executor/#sparkexecutoruserclasspathfirst","title":"spark.executor.userClassPathFirst

                                                                Executor reads the value of the spark.executor.userClassPathFirst configuration property when created.

                                                                When enabled, Executor uses ChildFirstURLClassLoader (not MutableURLClassLoader) when requested to createClassLoader (and addReplClassLoaderIfNeeded).

                                                                ","text":""},{"location":"executor/Executor/#user-defined-jars","title":"User-Defined Jars

                                                                Executor is given user-defined jars when created. No jars are assumed by default.

                                                                The jars are specified using spark.executor.extraClassPath configuration property (via --user-class-path command-line option of CoarseGrainedExecutorBackend).

                                                                ","text":""},{"location":"executor/Executor/#running-tasks-registry","title":"Running Tasks Registry
                                                                runningTasks: Map[Long, TaskRunner]\n

                                                                Executor tracks TaskRunners by task IDs.

                                                                ","text":""},{"location":"executor/Executor/#heartbeatreceiver-rpc-endpoint-reference","title":"HeartbeatReceiver RPC Endpoint Reference

                                                                When created, Executor creates an RPC endpoint reference to HeartbeatReceiver (running on the driver).

                                                                Executor uses the RPC endpoint reference when requested to reportHeartBeat.

                                                                ","text":""},{"location":"executor/Executor/#launching-task","title":"Launching Task
                                                                launchTask(\n  context: ExecutorBackend,\n  taskDescription: TaskDescription): Unit\n

                                                                launchTask creates a TaskRunner (with the given ExecutorBackend, the TaskDescription and the PluginContainer) and adds it to the runningTasks internal registry.

                                                                launchTask requests the \"Executor task launch worker\" thread pool to execute the TaskRunner (sometime in the future).

                                                                In case the decommissioned flag is enabled, launchTask prints out the following ERROR message to the logs:

                                                                Launching a task while in decommissioned state.\n

                                                                launchTask is used when:

                                                                • CoarseGrainedExecutorBackend is requested to handle a LaunchTask message
                                                                • LocalEndpoint RPC endpoint (of LocalSchedulerBackend) is requested to reviveOffers
                                                                ","text":""},{"location":"executor/Executor/#sending-heartbeats-and-active-tasks-metrics","title":"Sending Heartbeats and Active Tasks Metrics

                                                                Executors keep sending metrics for active tasks to the driver every spark.executor.heartbeatInterval (defaults to 10s with some random initial delay so the heartbeats from different executors do not pile up on the driver).

                                                                An executor sends heartbeats using the Heartbeat Sender Thread.

                                                                For each task in TaskRunner (in runningTasks internal registry), the task's metrics are computed and become part of the heartbeat (with accumulators).

                                                                A blocking Heartbeat message that holds the executor id, all accumulator updates (per task id), and BlockManagerId is sent to HeartbeatReceiver RPC endpoint.

                                                                If the response requests to re-register BlockManager, Executor prints out the following INFO message to the logs:

                                                                Told to re-register on heartbeat\n

                                                                BlockManager is requested to reregister.

                                                                The internal heartbeatFailures counter is reset.

                                                                If there are any issues with communicating with the driver, Executor prints out the following WARN message to the logs:

                                                                Issue communicating with driver in heartbeater\n

                                                                The internal heartbeatFailures is incremented and checked to be less than the spark.executor.heartbeat.maxFailures. If the number is greater, the following ERROR is printed out to the logs:

                                                                Exit as unable to send heartbeats to driver more than [HEARTBEAT_MAX_FAILURES] times\n

                                                                The executor exits (using System.exit and exit code 56).

                                                                ","text":""},{"location":"executor/Executor/#heartbeat-sender-thread","title":"Heartbeat Sender Thread

                                                                heartbeater is a ScheduledThreadPoolExecutor (Java) with a single thread.

                                                                The name of the thread pool is driver-heartbeater.

                                                                ","text":""},{"location":"executor/Executor/#executor-task-launch-worker-thread-pool","title":"Executor task launch worker Thread Pool

                                                                When created, Executor creates threadPool daemon cached thread pool with the name Executor task launch worker-[ID] (with ID being the task id).

                                                                The threadPool thread pool is used for launching tasks.

                                                                ","text":""},{"location":"executor/Executor/#executor-memory","title":"Executor Memory

                                                                The amount of memory per executor is configured using spark.executor.memory configuration property. It sets the available memory equally for all executors per application.

                                                                You can find the value displayed as Memory per Node in the web UI of the standalone Master.

                                                                ","text":""},{"location":"executor/Executor/#heartbeating-with-partial-metrics-for-active-tasks-to-driver","title":"Heartbeating With Partial Metrics For Active Tasks To Driver
                                                                reportHeartBeat(): Unit\n

                                                                reportHeartBeat collects TaskRunners for currently running tasks (active tasks) with their tasks deserialized (i.e. either ready for execution or already started).

                                                                TaskRunner has task deserialized when it runs the task.

                                                                For every running task, reportHeartBeat takes the TaskMetrics and:

                                                                • Requests ShuffleRead metrics to be merged
                                                                • Sets jvmGCTime metrics

                                                                reportHeartBeat then records the latest values of internal and external accumulators for every task.

                                                                Note

                                                                Internal accumulators are a task's metrics while external accumulators are a Spark application's accumulators that a user has created.

                                                                reportHeartBeat sends a blocking Heartbeat message to the HeartbeatReceiver (on the driver). reportHeartBeat uses the value of spark.executor.heartbeatInterval configuration property for the RPC timeout.

                                                                Note

                                                                A Heartbeat message contains the executor identifier, the accumulator updates, and the identifier of the BlockManager.

                                                                If the response (from HeartbeatReceiver) is to re-register the BlockManager, reportHeartBeat prints out the following INFO message to the logs and requests the BlockManager to re-register (which will register the blocks the BlockManager manages with the driver).

                                                                Told to re-register on heartbeat\n

                                                                HeartbeatResponse requests the BlockManager to re-register when either TaskScheduler or HeartbeatReceiver know nothing about the executor.

                                                                When posting the Heartbeat was successful, reportHeartBeat resets heartbeatFailures internal counter.

                                                                In case of a non-fatal exception, you should see the following WARN message in the logs (followed by the stack trace).

                                                                Issue communicating with driver in heartbeater\n

                                                                Every failure reportHeartBeat increments heartbeat failures up to spark.executor.heartbeat.maxFailures configuration property. When the heartbeat failures reaches the maximum, reportHeartBeat prints out the following ERROR message to the logs and the executor terminates with the error code: 56.

                                                                Exit as unable to send heartbeats to driver more than [HEARTBEAT_MAX_FAILURES] times\n

                                                                reportHeartBeat is used when:

                                                                • Executor is requested to schedule reporting heartbeat and partial metrics for active tasks to the driver (that happens every spark.executor.heartbeatInterval).
                                                                ","text":""},{"location":"executor/Executor/#sparkexecutorheartbeatmaxfailures","title":"spark.executor.heartbeat.maxFailures

                                                                Executor uses spark.executor.heartbeat.maxFailures configuration property in reportHeartBeat.

                                                                ","text":""},{"location":"executor/Executor/#logging","title":"Logging

                                                                Enable ALL logging level for org.apache.spark.executor.Executor logger to see what happens inside.

                                                                Add the following line to conf/log4j.properties:

                                                                log4j.logger.org.apache.spark.executor.Executor=ALL\n

                                                                Refer to Logging.

                                                                ","text":""},{"location":"executor/ExecutorBackend/","title":"ExecutorBackend","text":"

                                                                ExecutorBackend is an abstraction of executor backends (that TaskRunners use to report task status updates to a scheduler).

                                                                ExecutorBackend acts as a bridge between executors and the driver.

                                                                "},{"location":"executor/ExecutorBackend/#contract","title":"Contract","text":""},{"location":"executor/ExecutorBackend/#statusUpdate","title":"Reporting Task Status","text":"
                                                                statusUpdate(\n  taskId: Long,\n  state: TaskState,\n  data: ByteBuffer): Unit\n

                                                                Reports task status of the given task to a scheduler

                                                                See:

                                                                • CoarseGrainedExecutorBackend

                                                                Used when:

                                                                • TaskRunner is requested to run a task
                                                                "},{"location":"executor/ExecutorBackend/#implementations","title":"Implementations","text":"
                                                                • CoarseGrainedExecutorBackend
                                                                • LocalSchedulerBackend
                                                                • MesosExecutorBackend
                                                                "},{"location":"executor/ExecutorLogUrlHandler/","title":"ExecutorLogUrlHandler","text":""},{"location":"executor/ExecutorLogUrlHandler/#creating-instance","title":"Creating Instance","text":"

                                                                ExecutorLogUrlHandler takes the following to be created:

                                                                • Optional Log URL Pattern

                                                                  ExecutorLogUrlHandler is created\u00a0for the following:

                                                                  • DriverEndpoint
                                                                  • HistoryAppStatusStore
                                                                  "},{"location":"executor/ExecutorLogUrlHandler/#applying-pattern","title":"Applying Pattern
                                                                  applyPattern(\n  logUrls: Map[String, String],\n  attributes: Map[String, String]): Map[String, String]\n

                                                                  applyPattern doApplyPattern for logUrlPattern defined or simply returns the given logUrls back.

                                                                  applyPattern\u00a0is used when:

                                                                  • DriverEndpoint is requested to handle a RegisterExecutor message (and creates a ExecutorData)
                                                                  • HistoryAppStatusStore is requested to replaceLogUrls
                                                                  ","text":""},{"location":"executor/ExecutorLogUrlHandler/#doapplypattern","title":"doApplyPattern
                                                                  doApplyPattern(\n  logUrls: Map[String, String],\n  attributes: Map[String, String],\n  urlPattern: String): Map[String, String]\n

                                                                  doApplyPattern...FIXME

                                                                  ","text":""},{"location":"executor/ExecutorMetricType/","title":"ExecutorMetricType","text":"

                                                                  ExecutorMetricType is an abstraction of executor metric types.

                                                                  "},{"location":"executor/ExecutorMetricType/#contract","title":"Contract","text":""},{"location":"executor/ExecutorMetricType/#metric-values","title":"Metric Values
                                                                  getMetricValues(\n  memoryManager: MemoryManager): Array[Long]\n

                                                                  Used when:

                                                                  • ExecutorMetrics utility is used for the current metric values
                                                                  ","text":""},{"location":"executor/ExecutorMetricType/#metric-names","title":"Metric Names
                                                                  names: Seq[String]\n

                                                                  Used when:

                                                                  • ExecutorMetricType utility is used for the metricToOffset and number of metrics
                                                                  ","text":""},{"location":"executor/ExecutorMetricType/#implementations","title":"Implementations","text":"Sealed Trait

                                                                  ExecutorMetricType is a Scala sealed trait which means that all of the implementations are in the same compilation unit (a single file).

                                                                  Learn more in the Scala Language Specification.

                                                                  • GarbageCollectionMetrics
                                                                  • ProcessTreeMetrics
                                                                  • SingleValueExecutorMetricType
                                                                  • JVMHeapMemory
                                                                  • JVMOffHeapMemory
                                                                  • MBeanExecutorMetricType
                                                                  • DirectPoolMemory
                                                                  • MappedPoolMemory
                                                                  • MemoryManagerExecutorMetricType
                                                                  • OffHeapExecutionMemory
                                                                  • OffHeapStorageMemory
                                                                  • OffHeapUnifiedMemory
                                                                  • OnHeapExecutionMemory
                                                                  • OnHeapStorageMemory
                                                                  • OnHeapUnifiedMemory
                                                                  "},{"location":"executor/ExecutorMetricType/#executor-metric-getters-ordered-executormetrictypes","title":"Executor Metric Getters (Ordered ExecutorMetricTypes)

                                                                  ExecutorMetricType defines an ordered collection of ExecutorMetricTypes:

                                                                  1. JVMHeapMemory
                                                                  2. JVMOffHeapMemory
                                                                  3. OnHeapExecutionMemory
                                                                  4. OffHeapExecutionMemory
                                                                  5. OnHeapStorageMemory
                                                                  6. OffHeapStorageMemory
                                                                  7. OnHeapUnifiedMemory
                                                                  8. OffHeapUnifiedMemory
                                                                  9. DirectPoolMemory
                                                                  10. MappedPoolMemory
                                                                  11. ProcessTreeMetrics
                                                                  12. GarbageCollectionMetrics

                                                                  This ordering allows for passing metric values as arrays (to save space) with indices being a metric of a metric type.

                                                                  metricGetters is used when:

                                                                  • ExecutorMetrics utility is used for the current metric values
                                                                  • ExecutorMetricType utility is used to get the metricToOffset and the numMetrics
                                                                  ","text":""},{"location":"executor/ExecutorMetrics/","title":"ExecutorMetrics","text":"

                                                                  ExecutorMetrics is a collection of executor metrics.

                                                                  ","tags":["DeveloperApi"]},{"location":"executor/ExecutorMetrics/#creating-instance","title":"Creating Instance","text":"

                                                                  ExecutorMetrics takes the following to be created:

                                                                  • Metrics

                                                                    ExecutorMetrics is created when:

                                                                    • SparkContext is requested to reportHeartBeat
                                                                    • DAGScheduler is requested to post a SparkListenerTaskEnd event
                                                                    • ExecutorMetricsPoller is requested to getExecutorUpdates
                                                                    • ExecutorMetricsJsonDeserializer is requested to deserialize
                                                                    • JsonProtocol is requested to executorMetricsFromJson
                                                                    ","tags":["DeveloperApi"]},{"location":"executor/ExecutorMetrics/#current-metric-values","title":"Current Metric Values
                                                                    getCurrentMetrics(\n  memoryManager: MemoryManager): Array[Long]\n

                                                                    getCurrentMetrics gives metric values for every metric getter.

                                                                    Given that one metric getter (type) can report multiple metrics, the length of the result collection is the number of metrics (and at least the number of metric getters). The order matters and is exactly as metricGetters.

                                                                    getCurrentMetrics is used when:

                                                                    • SparkContext is requested to reportHeartBeat
                                                                    • ExecutorMetricsPoller is requested to poll
                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"executor/ExecutorMetricsPoller/","title":"ExecutorMetricsPoller","text":""},{"location":"executor/ExecutorMetricsPoller/#creating-instance","title":"Creating Instance","text":"

                                                                    ExecutorMetricsPoller takes the following to be created:

                                                                    • MemoryManager
                                                                    • spark.executor.metrics.pollingInterval
                                                                    • ExecutorMetricsSource

                                                                      ExecutorMetricsPoller is created when:

                                                                      • Executor is created
                                                                      "},{"location":"executor/ExecutorMetricsPoller/#executor-metrics-poller","title":"executor-metrics-poller

                                                                      ExecutorMetricsPoller creates a ScheduledExecutorService (Java) when created with the spark.executor.metrics.pollingInterval greater than 0.

                                                                      The ScheduledExecutorService manages 1 daemon thread with executor-metrics-poller name prefix.

                                                                      The ScheduledExecutorService is requested to poll at every pollingInterval when ExecutorMetricsPoller is requested to start until stop.

                                                                      ","text":""},{"location":"executor/ExecutorMetricsPoller/#poll","title":"poll
                                                                      poll(): Unit\n

                                                                      poll...FIXME

                                                                      poll is used when:

                                                                      • Executor is requested to reportHeartBeat
                                                                      • ExecutorMetricsPoller is requested to start
                                                                      ","text":""},{"location":"executor/ExecutorMetricsSource/","title":"ExecutorMetricsSource","text":"

                                                                      ExecutorMetricsSource is a metrics source.

                                                                      "},{"location":"executor/ExecutorMetricsSource/#creating-instance","title":"Creating Instance","text":"

                                                                      ExecutorMetricsSource takes no arguments to be created.

                                                                      ExecutorMetricsSource is created when:

                                                                      • SparkContext is created (with spark.metrics.executorMetricsSource.enabled enabled)
                                                                      • Executor is created (with spark.metrics.executorMetricsSource.enabled enabled)
                                                                      "},{"location":"executor/ExecutorMetricsSource/#source-name","title":"Source Name
                                                                      sourceName: String\n

                                                                      sourceName is ExecutorMetrics.

                                                                      sourceName is part of the Source abstraction.

                                                                      ","text":""},{"location":"executor/ExecutorMetricsSource/#registering-with-metricssystem","title":"Registering with MetricsSystem
                                                                      register(\n  metricsSystem: MetricsSystem): Unit\n

                                                                      register creates ExecutorMetricGauges for every executor metric.

                                                                      register requests the MetricRegistry to register every metric type.

                                                                      In the end, register requests the MetricRegistry to register this ExecutorMetricsSource.

                                                                      register is used when:

                                                                      • SparkContext is created
                                                                      • Executor is created (for non-local mode)
                                                                      ","text":""},{"location":"executor/ExecutorMetricsSource/#metrics-snapshot","title":"Metrics Snapshot

                                                                      ExecutorMetricsSource defines metricsSnapshot internal registry of values of every metric.

                                                                      The values are updated in updateMetricsSnapshot and read using ExecutorMetricGauges.

                                                                      ","text":""},{"location":"executor/ExecutorMetricsSource/#updatemetricssnapshot","title":"updateMetricsSnapshot
                                                                      updateMetricsSnapshot(\n  metricsUpdates: Array[Long]): Unit\n

                                                                      updateMetricsSnapshot updates the metricsSnapshot registry with the given metricsUpdates.

                                                                      updateMetricsSnapshot is used when:

                                                                      • SparkContext is requested to reportHeartBeat
                                                                      • ExecutorMetricsPoller is requested to poll
                                                                      ","text":""},{"location":"executor/ExecutorSource/","title":"ExecutorSource","text":"

                                                                      ExecutorSource is a Source of Executors.

                                                                      "},{"location":"executor/ExecutorSource/#creating-instance","title":"Creating Instance","text":"

                                                                      ExecutorSource takes the following to be created:

                                                                      • ThreadPoolExecutor
                                                                      • Executor ID (unused)
                                                                      • File System Schemes (to report based on spark.executor.metrics.fileSystemSchemes)

                                                                        ExecutorSource is created\u00a0when:

                                                                        • Executor is created
                                                                        "},{"location":"executor/ExecutorSource/#name","title":"Name

                                                                        ExecutorSource is known under the name executor.

                                                                        ","text":""},{"location":"executor/ExecutorSource/#metrics","title":"Metrics
                                                                        metricRegistry: MetricRegistry\n

                                                                        metricRegistry is part of the Source abstraction.

                                                                        Name Description threadpool.activeTasks Approximate number of threads that are actively executing tasks (based on ThreadPoolExecutor.getActiveCount) others","text":""},{"location":"executor/ShuffleReadMetrics/","title":"ShuffleReadMetrics","text":"

                                                                        ShuffleReadMetrics is a collection of metrics (accumulators) on reading shuffle data.

                                                                        ","tags":["DeveloperApi"]},{"location":"executor/ShuffleReadMetrics/#taskmetrics","title":"TaskMetrics

                                                                        ShuffleReadMetrics is available using TaskMetrics.shuffleReadMetrics.

                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"executor/ShuffleReadMetrics/#serializable","title":"Serializable

                                                                        ShuffleReadMetrics is a Serializable (Java).

                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"executor/ShuffleWriteMetrics/","title":"ShuffleWriteMetrics","text":"

                                                                        ShuffleWriteMetrics is a ShuffleWriteMetricsReporter of metrics (accumulators) related to writing shuffle data (in shuffle map tasks):

                                                                        • Shuffle Bytes Written
                                                                        • Shuffle Write Time
                                                                        • Shuffle Records Written
                                                                        ","tags":["DeveloperApi"]},{"location":"executor/ShuffleWriteMetrics/#creating-instance","title":"Creating Instance","text":"

                                                                        ShuffleWriteMetrics takes no input arguments to be created.

                                                                        ShuffleWriteMetrics is created\u00a0when:

                                                                        • TaskMetrics is created
                                                                        • ShuffleExternalSorter is requested to writeSortedFile
                                                                        • MapIterator (of BytesToBytesMap) is requested to spill
                                                                        • ExternalAppendOnlyMap is created
                                                                        • ExternalSorter is requested to spillMemoryIteratorToDisk
                                                                        • UnsafeExternalSorter is requested to spill
                                                                        • SpillableIterator (of UnsafeExternalSorter) is requested to spill
                                                                        ","tags":["DeveloperApi"]},{"location":"executor/ShuffleWriteMetrics/#taskmetrics","title":"TaskMetrics

                                                                        ShuffleWriteMetrics is available using TaskMetrics.shuffleWriteMetrics.

                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"executor/ShuffleWriteMetrics/#serializable","title":"Serializable

                                                                        ShuffleWriteMetrics is a Serializable (Java).

                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/","title":"TaskMetrics","text":"

                                                                        TaskMetrics is a collection of metrics (accumulators) tracked during execution of a task.

                                                                        ","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#creating-instance","title":"Creating Instance","text":"

                                                                        TaskMetrics takes no input arguments to be created.

                                                                        TaskMetrics is created\u00a0when:

                                                                        • Stage is requested to makeNewStageAttempt
                                                                        ","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#metrics","title":"Metrics","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#shufflewritemetrics","title":"ShuffleWriteMetrics

                                                                        ShuffleWriteMetrics

                                                                        • shuffle.write.bytesWritten
                                                                        • shuffle.write.recordsWritten
                                                                        • shuffle.write.writeTime

                                                                        ShuffleWriteMetrics is exposed using Dropwizard metrics system using ExecutorSource (when TaskRunner is about to finish running):

                                                                        • shuffleBytesWritten
                                                                        • shuffleRecordsWritten
                                                                        • shuffleWriteTime

                                                                        ShuffleWriteMetrics can be monitored using:

                                                                        • StatsReportListener (when a stage completes)
                                                                          • shuffle bytes written
                                                                        • JsonProtocol (when requested to taskMetricsToJson)
                                                                          • Shuffle Bytes Written
                                                                          • Shuffle Write Time
                                                                          • Shuffle Records Written

                                                                        shuffleWriteMetrics is used when:

                                                                        • ShuffleWriteProcessor is requested for a ShuffleWriteMetricsReporter
                                                                        • SortShuffleWriter is created
                                                                        • AppStatusListener is requested to handle a SparkListenerTaskEnd
                                                                        • LiveTask is requested to updateMetrics
                                                                        • ExternalSorter is requested to writePartitionedFile (to create a DiskBlockObjectWriter), writePartitionedMapOutput
                                                                        • ShuffleExchangeExec (Spark SQL) is requested for a ShuffleWriteProcessor (to create a ShuffleDependency)
                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#memory-bytes-spilled","title":"Memory Bytes Spilled

                                                                        Number of in-memory bytes spilled by the tasks (of a stage)

                                                                        _memoryBytesSpilled is a LongAccumulator with internal.metrics.memoryBytesSpilled name.

                                                                        memoryBytesSpilled metric is exposed using ExecutorSource as memoryBytesSpilled (using Dropwizard metrics system).

                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#memorybytesspilled","title":"memoryBytesSpilled","text":"
                                                                        memoryBytesSpilled: Long\n

                                                                        memoryBytesSpilled is the sum of all memory bytes spilled across all tasks.

                                                                        memoryBytesSpilled is used when:

                                                                        • SpillListener is requested to onStageCompleted
                                                                        • TaskRunner is requested to run (and updates task metrics in the Dropwizard metrics system)
                                                                        • LiveTask is requested to updateMetrics
                                                                        • JsonProtocol is requested to taskMetricsToJson
                                                                        ","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#incmemorybytesspilled","title":"incMemoryBytesSpilled","text":"
                                                                        incMemoryBytesSpilled(\n  v: Long): Unit\n

                                                                        incMemoryBytesSpilled adds the v value to the _memoryBytesSpilled metric.

                                                                        incMemoryBytesSpilled is used when:

                                                                        • Aggregator is requested to updateMetrics
                                                                        • BasePythonRunner.ReaderIterator is requested to handleTimingData
                                                                        • CoGroupedRDD is requested to compute a partition
                                                                        • ShuffleExternalSorter is requested to spill
                                                                        • JsonProtocol is requested to taskMetricsFromJson
                                                                        • ExternalSorter is requested to insertAllAndUpdateMetrics, writePartitionedFile, writePartitionedMapOutput
                                                                        • UnsafeExternalSorter is requested to createWithExistingInMemorySorter, spill
                                                                        • UnsafeExternalSorter.SpillableIterator is requested to spill
                                                                        ","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#taskcontext","title":"TaskContext

                                                                        TaskMetrics is available using TaskContext.taskMetrics.

                                                                        TaskContext.get.taskMetrics\n
                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#serializable","title":"Serializable

                                                                        TaskMetrics is a Serializable (Java).

                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#task","title":"Task

                                                                        TaskMetrics is part of Task.

                                                                        task.metrics\n
                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#sparklistener","title":"SparkListener

                                                                        TaskMetrics is available using SparkListener and intercepting SparkListenerTaskEnd events.

                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#statsreportlistener","title":"StatsReportListener

                                                                        StatsReportListener can be used for summary statistics at runtime (after a stage completes).

                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#spark-history-server","title":"Spark History Server

                                                                        Spark History Server uses EventLoggingListener to intercept post-execution statistics (incl. TaskMetrics).

                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskRunner/","title":"TaskRunner","text":"

                                                                        TaskRunner is a thread of execution to run a task.

                                                                        Internal Class

                                                                        TaskRunner is an internal class of Executor with full access to internal registries.

                                                                        TaskRunner is a java.lang.Runnable so once a TaskRunner has completed execution it must not be restarted.

                                                                        "},{"location":"executor/TaskRunner/#creating-instance","title":"Creating Instance","text":"

                                                                        TaskRunner takes the following to be created:

                                                                        • ExecutorBackend (that manages the parent Executor)
                                                                        • TaskDescription
                                                                        • PluginContainer
                                                                        • TaskRunner is created\u00a0when:

                                                                          • Executor is requested to launch a task
                                                                          "},{"location":"executor/TaskRunner/#plugincontainer","title":"PluginContainer

                                                                          TaskRunner may be given a PluginContainer when created.

                                                                          The PluginContainer is used when TaskRunner is requested to run (for the Task to run).

                                                                          ","text":""},{"location":"executor/TaskRunner/#demo","title":"Demo
                                                                          ./bin/spark-shell --conf spark.driver.maxResultSize=1m\n
                                                                          scala> println(sc.version)\n3.0.1\n
                                                                          val maxResultSize = sc.getConf.get(\"spark.driver.maxResultSize\")\nassert(maxResultSize == \"1m\")\n
                                                                          val rddOver1m = sc.range(0, 1024 * 1024 + 10, 1)\n
                                                                          scala> rddOver1m.collect\nERROR TaskSetManager: Total size of serialized results of 2 tasks (1030.8 KiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\nERROR TaskSetManager: Total size of serialized results of 3 tasks (1546.2 KiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\nERROR TaskSetManager: Total size of serialized results of 4 tasks (2.0 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\nWARN TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7, 192.168.68.105, executor driver): TaskKilled (Tasks result size has exceeded maxResultSize)\nWARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, 192.168.68.105, executor driver): TaskKilled (Tasks result size has exceeded maxResultSize)\nWARN TaskSetManager: Lost task 12.0 in stage 0.0 (TID 12, 192.168.68.105, executor driver): TaskKilled (Tasks result size has exceeded maxResultSize)\nERROR TaskSetManager: Total size of serialized results of 5 tasks (2.5 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\nWARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8, 192.168.68.105, executor driver): TaskKilled (Tasks result size has exceeded maxResultSize)\n...\norg.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 2 tasks (1030.8 KiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\n  at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)\n  ...\n
                                                                          ","text":""},{"location":"executor/TaskRunner/#thread-name","title":"Thread Name

                                                                          TaskRunner uses the following thread name (with the taskId of the TaskDescription):

                                                                          Executor task launch worker for task [taskId]\n
                                                                          ","text":""},{"location":"executor/TaskRunner/#running-task","title":"Running Task
                                                                          run(): Unit\n

                                                                          run is part of the java.lang.Runnable abstraction.

                                                                          ","text":""},{"location":"executor/TaskRunner/#initialization","title":"Initialization

                                                                          run initializes the threadId internal registry as the current thread identifier (using Thread.getId).

                                                                          run sets the name of the current thread of execution as the threadName.

                                                                          run creates a TaskMemoryManager (for the current MemoryManager and taskId). run uses SparkEnv to access the current MemoryManager.

                                                                          run starts tracking the time to deserialize a task and sets the current thread's context classloader.

                                                                          run creates a closure Serializer. run uses SparkEnv to access the closure Serializer.

                                                                          run prints out the following INFO message to the logs (with the taskName and taskId):

                                                                          Running [taskName] (TID [taskId])\n

                                                                          run notifies the ExecutorBackend that the status of the task has changed to RUNNING (for the taskId).

                                                                          run computes the total amount of time this JVM process has spent in garbage collection.

                                                                          run uses the addedFiles and addedJars (of the given TaskDescription) to update dependencies.

                                                                          run takes the serializedTask of the given TaskDescription and requests the closure Serializer to deserialize the task. run sets the task internal reference to hold the deserialized task.

                                                                          For non-local environments, run prints out the following DEBUG message to the logs before requesting the MapOutputTrackerWorker to update the epoch (using the epoch of the Task to be executed). run uses SparkEnv to access the MapOutputTrackerWorker.

                                                                          Task [taskId]'s epoch is [epoch]\n

                                                                          run requests the metricsPoller...FIXME

                                                                          run records the current time as the task's start time (taskStartTimeNs).

                                                                          run requests the Task to run (with taskAttemptId as taskId, attemptNumber from TaskDescription, and metricsSystem as the current MetricsSystem).

                                                                          Note

                                                                          run uses SparkEnv to access the MetricsSystem.

                                                                          Note

                                                                          The task runs inside a \"monitored\" block (try-finally block) to detect any memory and lock leaks after the task's run finishes regardless of the final outcome - the computed value or an exception thrown.

                                                                          run creates a Serializer and requests it to serialize the task result (valueBytes).

                                                                          Note

                                                                          run uses SparkEnv to access the Serializer.

                                                                          run updates the metrics of the Task executed.

                                                                          run updates the metric counters in the ExecutorSource.

                                                                          run requests the Task executed for accumulator updates and the ExecutorMetricsPoller for metric peaks.

                                                                          ","text":""},{"location":"executor/TaskRunner/#serialized-task-result","title":"Serialized Task Result

                                                                          run creates a DirectTaskResult (with the serialized task result, the accumulator updates and the metric peaks) and requests the closure Serializer to serialize it.

                                                                          Note

                                                                          The serialized DirectTaskResult is a java.nio.ByteBuffer.

                                                                          run selects between the DirectTaskResult and an IndirectTaskResult based on the size of the serialized task result (limit of this serializedDirectResult byte buffer):

                                                                          1. With the size above spark.driver.maxResultSize, run prints out the following WARN message to the logs and serializes an IndirectTaskResult with a TaskResultBlockId.

                                                                            Finished [taskName] (TID [taskId]). Result is larger than maxResultSize ([resultSize] > [maxResultSize]), dropping it.\n
                                                                          2. With the size above maxDirectResultSize, run creates an TaskResultBlockId and requests the BlockManager to store the task result locally (with MEMORY_AND_DISK_SER). run prints out the following INFO message to the logs and serializes an IndirectTaskResult with a TaskResultBlockId.

                                                                            Finished [taskName] (TID [taskId]). [resultSize] bytes result sent via BlockManager)\n
                                                                          3. run prints out the following INFO message to the logs and uses the DirectTaskResult created earlier.

                                                                            Finished [taskName] (TID [taskId]). [resultSize] bytes result sent to driver\n

                                                                          Note

                                                                          serializedResult is either a IndirectTaskResult (possibly with the block stored in BlockManager) or a DirectTaskResult.

                                                                          ","text":""},{"location":"executor/TaskRunner/#incrementing-succeededtasks-counter","title":"Incrementing succeededTasks Counter

                                                                          run requests the ExecutorSource to increment succeededTasks counter.

                                                                          ","text":""},{"location":"executor/TaskRunner/#marking-task-finished","title":"Marking Task Finished

                                                                          run setTaskFinishedAndClearInterruptStatus.

                                                                          ","text":""},{"location":"executor/TaskRunner/#notifying-executorbackend-that-task-finished","title":"Notifying ExecutorBackend that Task Finished

                                                                          run notifies the ExecutorBackend that the status of the taskId has changed to FINISHED.

                                                                          Note

                                                                          ExecutorBackend is given when the TaskRunner is created.

                                                                          ","text":""},{"location":"executor/TaskRunner/#wrapping-up","title":"Wrapping Up

                                                                          In the end, regardless of the task's execution status (successful or failed), run removes the taskId from runningTasks registry.

                                                                          In case a onTaskStart notification was sent out, run requests the ExecutorMetricsPoller to onTaskCompletion.

                                                                          ","text":""},{"location":"executor/TaskRunner/#exceptions","title":"Exceptions

                                                                          run handles certain exceptions.

                                                                          Exception Type TaskState Serialized ByteBuffer FetchFailedException FAILED TaskFailedReason TaskKilledException KILLED TaskKilled InterruptedException KILLED TaskKilled CommitDeniedException FAILED TaskFailedReason Throwable FAILED ExceptionFailure","text":""},{"location":"executor/TaskRunner/#fetchfailedexception","title":"FetchFailedException

                                                                          When shuffle:FetchFailedException.md[FetchFailedException] is reported while running a task, run <>.

                                                                          run shuffle:FetchFailedException.md#toTaskFailedReason[requests FetchFailedException for the TaskFailedReason], serializes it and ExecutorBackend.md#statusUpdate[notifies ExecutorBackend that the task has failed] (with <>, TaskState.FAILED, and a serialized reason).

                                                                          NOTE: ExecutorBackend was specified when <>.

                                                                          NOTE: run uses a closure serializer:Serializer.md[Serializer] to serialize the failure reason. The Serializer was created before run ran the task.

                                                                          ","text":""},{"location":"executor/TaskRunner/#taskkilledexception","title":"TaskKilledException

                                                                          When TaskKilledException is reported while running a task, you should see the following INFO message in the logs:

                                                                          Executor killed [taskName] (TID [taskId]), reason: [reason]\n

                                                                          run then <> and ExecutorBackend.md#statusUpdate[notifies ExecutorBackend that the task has been killed] (with <>, TaskState.KILLED, and a serialized TaskKilled object).","text":""},{"location":"executor/TaskRunner/#interruptedexception-with-task-killed","title":"InterruptedException (with Task Killed)

                                                                          When InterruptedException is reported while running a task, and the task has been killed, you should see the following INFO message in the logs:

                                                                          Executor interrupted and killed [taskName] (TID [taskId]), reason: [killReason]\n

                                                                          run then <> and ExecutorBackend.md#statusUpdate[notifies ExecutorBackend that the task has been killed] (with <>, TaskState.KILLED, and a serialized TaskKilled object).

                                                                          NOTE: The difference between this InterruptedException and <> is the INFO message in the logs.","text":""},{"location":"executor/TaskRunner/#commitdeniedexception","title":"CommitDeniedException

                                                                          When CommitDeniedException is reported while running a task, run <> and ExecutorBackend.md#statusUpdate[notifies ExecutorBackend that the task has failed] (with <>, TaskState.FAILED, and a serialized TaskKilled object).

                                                                          NOTE: The difference between this CommitDeniedException and <> is just the reason being sent to ExecutorBackend.","text":""},{"location":"executor/TaskRunner/#throwable","title":"Throwable

                                                                          When run catches a Throwable, you should see the following ERROR message in the logs (followed by the exception).

                                                                          Exception in [taskName] (TID [taskId])\n

                                                                          run then records the following task metrics (only when <> is available):

                                                                          • TaskMetrics.md#setExecutorRunTime[executorRunTime]
                                                                          • TaskMetrics.md#setJvmGCTime[jvmGCTime]

                                                                          run then scheduler:Task.md#collectAccumulatorUpdates[collects the latest values of internal and external accumulators] (with taskFailed flag enabled to inform that the collection is for a failed task).

                                                                          Otherwise, when <> is not available, the accumulator collection is empty.

                                                                          run converts the task accumulators to collection of AccumulableInfo, creates a ExceptionFailure (with the accumulators), and serializer:Serializer.md#serialize[serializes them].

                                                                          NOTE: run uses a closure serializer:Serializer.md[Serializer] to serialize the ExceptionFailure.

                                                                          CAUTION: FIXME Why does run create new ExceptionFailure(t, accUpdates).withAccums(accums), i.e. accumulators occur twice in the object.

                                                                          run <> and ExecutorBackend.md#statusUpdate[notifies ExecutorBackend that the task has failed] (with <>, TaskState.FAILED, and the serialized ExceptionFailure).

                                                                          run may also trigger SparkUncaughtExceptionHandler.uncaughtException(t) if this is a fatal error.

                                                                          NOTE: The difference between this most Throwable case and other FAILED cases (i.e. <> and <>) is just the serialized ExceptionFailure vs a reason being sent to ExecutorBackend, respectively.","text":""},{"location":"executor/TaskRunner/#collectaccumulatorsandresetstatusonfailure","title":"collectAccumulatorsAndResetStatusOnFailure

                                                                          collectAccumulatorsAndResetStatusOnFailure(\n  taskStartTimeNs: Long)\n

                                                                          collectAccumulatorsAndResetStatusOnFailure...FIXME

                                                                          ","text":""},{"location":"executor/TaskRunner/#killing-task","title":"Killing Task
                                                                          kill(\n  interruptThread: Boolean,\n  reason: String): Unit\n

                                                                          kill marks the TaskRunner as <> and scheduler:Task.md#kill[kills the task] (if available and not <> already).

                                                                          NOTE: kill passes the input interruptThread on to the task itself while killing it.

                                                                          When executed, you should see the following INFO message in the logs:

                                                                          Executor is trying to kill [taskName] (TID [taskId]), reason: [reason]\n

                                                                          NOTE: <> flag is checked periodically in <> to stop executing the task. Once killed, the task will eventually stop.","text":""},{"location":"executor/TaskRunner/#logging","title":"Logging

                                                                          Enable ALL logging level for org.apache.spark.executor.Executor logger to see what happens inside.

                                                                          Add the following line to conf/log4j.properties:

                                                                          log4j.logger.org.apache.spark.executor.Executor=ALL\n

                                                                          Refer to Logging.

                                                                          ","text":""},{"location":"executor/TaskRunner/#internal-properties","title":"Internal Properties","text":""},{"location":"executor/TaskRunner/#finished-flag","title":"finished Flag

                                                                          finished flag says whether the <> has finished (true) or not (false)

                                                                          Default: false

                                                                          Enabled (true) after TaskRunner has been requested to <>

                                                                          Used when TaskRunner is requested to <>","text":""},{"location":"executor/TaskRunner/#reasonifkilled","title":"reasonIfKilled

                                                                          Reason to <> (and avoid <>)

                                                                          Default: (empty) (None)

                                                                          ","text":""},{"location":"executor/TaskRunner/#startgctime-timestamp","title":"startGCTime Timestamp

                                                                          Timestamp (which is really the Executor.md#computeTotalGcTime[total amount of time this Executor JVM process has already spent in garbage collection]) that is used to mark the GC \"zero\" time (when <>) and then compute the JVM GC time metric when:

                                                                          • TaskRunner is requested to <> and <>

                                                                          • Executor is requested to Executor.md#reportHeartBeat[reportHeartBeat]

                                                                          • ","text":""},{"location":"executor/TaskRunner/#task","title":"Task

                                                                            Deserialized scheduler:Task.md[task] to execute

                                                                            Used when:

                                                                            • TaskRunner is requested to <>, <>, <>, <>

                                                                            • Executor is requested to Executor.md#reportHeartBeat[reportHeartBeat]

                                                                            • ","text":""},{"location":"executor/TaskRunner/#task-name","title":"Task Name

                                                                              The name of the task (of the TaskDescription) that is used exclusively for <> purposes when TaskRunner is requested to <> and <> the task","text":""},{"location":"executor/TaskRunner/#thread-id","title":"Thread Id

                                                                              Current thread ID

                                                                              Default: -1

                                                                              Set immediately when TaskRunner is requested to <> and used exclusively when TaskReaper is requested for the thread info of the current thread (aka thread dump)","text":""},{"location":"exercises/spark-examples-wordcount-spark-shell/","title":"WordCount using Spark shell","text":"

                                                                              == WordCount using Spark shell

                                                                              It is like any introductory big data example should somehow demonstrate how to count words in distributed fashion.

                                                                              In the following example you're going to count the words in README.md file that sits in your Spark distribution and save the result under README.count directory.

                                                                              You're going to use spark-shell.md[the Spark shell] for the example. Execute spark-shell.

                                                                              "},{"location":"exercises/spark-examples-wordcount-spark-shell/#sourcescala","title":"[source,scala]","text":"

                                                                              val lines = sc.textFile(\"README.md\") // <1>

                                                                              val words = lines.flatMap(_.split(\"\\s+\")) // <2>

                                                                              val wc = words.map(w => (w, 1)).reduceByKey(_ + _) // <3>

                                                                              "},{"location":"exercises/spark-examples-wordcount-spark-shell/#wcsaveastextfilereadmecount-4","title":"wc.saveAsTextFile(\"README.count\") // <4>","text":"

                                                                              <1> Read the text file - refer to spark-io.md[Using Input and Output (I/O)]. <2> Split each line into words and flatten the result. <3> Map each word into a pair and count them by word (key). <4> Save the result into text files - one per partition.

                                                                              After you have executed the example, see the contents of the README.count directory:

                                                                              $ ls -lt README.count\ntotal 16\n-rw-r--r--  1 jacek  staff     0  9 pa\u017a 13:36 _SUCCESS\n-rw-r--r--  1 jacek  staff  1963  9 pa\u017a 13:36 part-00000\n-rw-r--r--  1 jacek  staff  1663  9 pa\u017a 13:36 part-00001\n

                                                                              The files part-0000x contain the pairs of word and the count.

                                                                              $ cat README.count/part-00000\n(package,1)\n(this,1)\n(Version\"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version),1)\n(Because,1)\n(Python,2)\n(cluster.,1)\n(its,1)\n([run,1)\n...\n

                                                                              === Further (self-)development

                                                                              Please read the questions and give answers first before looking at the link given.

                                                                              1. Why are there two files under the directory?
                                                                              2. How could you have only one?
                                                                              3. How to filter out words by name?
                                                                              4. How to count words?

                                                                              Please refer to the chapter spark-rdd-partitions.md[Partitions] to find some of the answers.

                                                                              "},{"location":"exercises/spark-exercise-custom-scheduler-listener/","title":"Developing Custom SparkListener to monitor DAGScheduler in Scala","text":"

                                                                              == Exercise: Developing Custom SparkListener to monitor DAGScheduler in Scala

                                                                              The example shows how to develop a custom Spark Listener. You should read SparkListener.md[] first to understand the motivation for the example.

                                                                              === Requirements

                                                                              1. https://www.jetbrains.com/idea/[IntelliJ IDEA] (or eventually http://www.scala-sbt.org/[sbt] alone if you're adventurous).
                                                                              2. Access to Internet to download Apache Spark's dependencies.

                                                                              === Setting up Scala project using IntelliJ IDEA

                                                                              Create a new project custom-spark-listener.

                                                                              Add the following line to build.sbt (the main configuration file for the sbt project) that adds the dependency on Apache Spark.

                                                                              libraryDependencies += \"org.apache.spark\" %% \"spark-core\" % \"2.0.1\"\n

                                                                              build.sbt should look as follows:

                                                                              "},{"location":"exercises/spark-exercise-custom-scheduler-listener/#source-scala","title":"[source, scala]","text":"

                                                                              name := \"custom-spark-listener\" organization := \"pl.jaceklaskowski.spark\" version := \"1.0\"

                                                                              scalaVersion := \"2.11.8\"

                                                                              "},{"location":"exercises/spark-exercise-custom-scheduler-listener/#librarydependencies-orgapachespark-spark-core-201","title":"libraryDependencies += \"org.apache.spark\" %% \"spark-core\" % \"2.0.1\"","text":"

                                                                              === Custom Listener - pl.jaceklaskowski.spark.CustomSparkListener

                                                                              Create a Scala class -- CustomSparkListener -- for your custom SparkListener. It should be under src/main/scala directory (create one if it does not exist).

                                                                              The aim of the class is to intercept scheduler events about jobs being started and tasks completed.

                                                                              "},{"location":"exercises/spark-exercise-custom-scheduler-listener/#sourcescala","title":"[source,scala]","text":"

                                                                              package pl.jaceklaskowski.spark

                                                                              import org.apache.spark.scheduler.{SparkListenerStageCompleted, SparkListener, SparkListenerJobStart}

                                                                              class CustomSparkListener extends SparkListener { override def onJobStart(jobStart: SparkListenerJobStart) { println(s\"Job started with ${jobStart.stageInfos.size} stages: $jobStart\") }

                                                                              override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit = { println(s\"Stage ${stageCompleted.stageInfo.stageId} completed with ${stageCompleted.stageInfo.numTasks} tasks.\") } }

                                                                              === Creating deployable package

                                                                              Package the custom Spark listener. Execute sbt package command in the custom-spark-listener project's main directory.

                                                                              $ sbt package\n[info] Loading global plugins from /Users/jacek/.sbt/0.13/plugins\n[info] Loading project definition from /Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/project\n[info] Updating {file:/Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/project/}custom-spark-listener-build...\n[info] Resolving org.fusesource.jansi#jansi;1.4 ...\n[info] Done updating.\n[info] Set current project to custom-spark-listener (in build file:/Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/)\n[info] Updating {file:/Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/}custom-spark-listener...\n[info] Resolving jline#jline;2.12.1 ...\n[info] Done updating.\n[info] Compiling 1 Scala source to /Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/target/scala-2.11/classes...\n[info] Packaging /Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/target/scala-2.11/custom-spark-listener_2.11-1.0.jar ...\n[info] Done packaging.\n[success] Total time: 8 s, completed Oct 27, 2016 11:23:50 AM\n

                                                                              You should find the result jar file with the custom scheduler listener ready under target/scala-2.11 directory, e.g. target/scala-2.11/custom-spark-listener_2.11-1.0.jar.

                                                                              === Activating Custom Listener in Spark shell

                                                                              Start ../spark-shell.md[spark-shell] with additional configurations for the extra custom listener and the jar that includes the class.

                                                                              $ spark-shell \\\n  --conf spark.logConf=true \\\n  --conf spark.extraListeners=pl.jaceklaskowski.spark.CustomSparkListener \\\n  --driver-class-path target/scala-2.11/custom-spark-listener_2.11-1.0.jar\n

                                                                              Create a ../spark-sql-Dataset.md#implicits[Dataset] and execute an action like show to start a job as follows:

                                                                              scala> spark.read.text(\"README.md\").count\n[CustomSparkListener] Job started with 2 stages: SparkListenerJobStart(1,1473946006715,WrappedArray(org.apache.spark.scheduler.StageInfo@71515592, org.apache.spark.scheduler.StageInfo@6852819d),{spark.rdd.scope.noOverride=true, spark.rdd.scope={\"id\":\"14\",\"name\":\"collect\"}, spark.sql.execution.id=2})\n[CustomSparkListener] Stage 1 completed with 1 tasks.\n[CustomSparkListener] Stage 2 completed with 1 tasks.\nres0: Long = 7\n

                                                                              The lines with [CustomSparkListener] came from your custom Spark listener. Congratulations! The exercise's over.

                                                                              === BONUS Activating Custom Listener in Spark Application

                                                                              TIP: Read SparkContext.md#addSparkListener[Registering SparkListener].

                                                                              === Questions

                                                                              1. What are the pros and cons of using the command line version vs inside a Spark application?
                                                                              "},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/","title":"Working with Datasets from JDBC Data Sources (and PostgreSQL)","text":"

                                                                              == Working with Datasets from JDBC Data Sources (and PostgreSQL)

                                                                              Start spark-shell with the JDBC driver for the database you want to use. In our case, it is PostgreSQL JDBC Driver.

                                                                              NOTE: Download the jar for PostgreSQL JDBC Driver 42.1.1 directly from the http://central.maven.org/maven2/org/postgresql/postgresql/42.1.1/postgresql-42.1.1.jar[Maven repository].

                                                                              "},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#tip","title":"[TIP]","text":"

                                                                              Execute the command to have the jar downloaded into ~/.ivy2/jars directory by spark-shell itself:

                                                                              ./bin/spark-shell --packages org.postgresql:postgresql:42.1.1\n

                                                                              The entire path to the driver file is then like /Users/jacek/.ivy2/jars/org.postgresql_postgresql-42.1.1.jar.

                                                                              You should see the following while spark-shell downloads the driver.

                                                                              "},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#ivy-default-cache-set-to-usersjacekivy2cache-the-jars-for-the-packages-stored-in-usersjacekivy2jars-loading-settings-url-jarfileusersjacekdevosssparkassemblytargetscala-211jarsivy-240jarorgapacheivycoresettingsivysettingsxml-orgpostgresqlpostgresql-added-as-a-dependency-resolving-dependencies-orgapachesparkspark-submit-parent10-confs-default-found-orgpostgresqlpostgresql4211-in-central-downloading-httpsrepo1mavenorgmaven2orgpostgresqlpostgresql4211postgresql-4211jar-successful-orgpostgresqlpostgresql4211postgresqljarbundle-205ms-resolution-report-resolve-1887ms-artifacts-dl-207ms-modules-in-use-orgpostgresqlpostgresql4211-from-central-in-default-modules-artifacts-conf-number-searchdwnldedevicted-numberdwnlded-default-1-1-1-0-1-1-retrieving-orgapachesparkspark-submit-parent-confs-default-1-artifacts-copied-0-already-retrieved-695kb8ms","title":"
                                                                              Ivy Default Cache set to: /Users/jacek/.ivy2/cache\nThe jars for the packages stored in: /Users/jacek/.ivy2/jars\n:: loading settings :: url = jar:file:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\norg.postgresql#postgresql added as a dependency\n:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0\n    confs: [default]\n    found org.postgresql#postgresql;42.1.1 in central\ndownloading https://repo1.maven.org/maven2/org/postgresql/postgresql/42.1.1/postgresql-42.1.1.jar ...\n    [SUCCESSFUL ] org.postgresql#postgresql;42.1.1!postgresql.jar(bundle) (205ms)\n:: resolution report :: resolve 1887ms :: artifacts dl 207ms\n    :: modules in use:\n    org.postgresql#postgresql;42.1.1 from central in [default]\n    ---------------------------------------------------------------------\n    |                  |            modules            ||   artifacts   |\n    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|\n    ---------------------------------------------------------------------\n    |      default     |   1   |   1   |   1   |   0   ||   1   |   1   |\n    ---------------------------------------------------------------------\n:: retrieving :: org.apache.spark#spark-submit-parent\n    confs: [default]\n    1 artifacts copied, 0 already retrieved (695kB/8ms)\n
                                                                              ","text":"

                                                                              Start ./bin/spark-shell with spark-submit/index.md#driver-class-path[--driver-class-path] command line option and the driver jar.

                                                                              SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell --driver-class-path /Users/jacek/.ivy2/jars/org.postgresql_postgresql-42.1.1.jar\n

                                                                              It will give you the proper setup for accessing PostgreSQL using the JDBC driver.

                                                                              Execute the following to access projects table in sparkdb.

                                                                              "},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#source-scala","title":"[source, scala]","text":"

                                                                              // that gives an one-partition Dataset val opts = Map( \"url\" -> \"jdbc:postgresql:sparkdb\", \"dbtable\" -> \"projects\") val df = spark. read. format(\"jdbc\"). options(opts). load

                                                                              NOTE: Use user and password options to specify the credentials if needed.

                                                                              "},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#source-scala_1","title":"[source, scala]","text":"

                                                                              // Note the number of partition (aka numPartitions) scala> df.explain == Physical Plan == *Scan JDBCRelation(projects) [numPartitions=1] [id#0,name#1,website#2] ReadSchema: struct

                                                                              scala> df.show(truncate = false) +---+------------+-----------------------+ |id |name |website | +---+------------+-----------------------+ |1 |Apache Spark|http://spark.apache.org| |2 |Apache Hive |http://hive.apache.org | |3 |Apache Kafka|http://kafka.apache.org| |4 |Apache Flink|http://flink.apache.org| +---+------------+-----------------------+

                                                                              // use jdbc method with predicates to define partitions import java.util.Properties val df4parts = spark. read. jdbc( url = \"jdbc:postgresql:sparkdb\", table = \"projects\", predicates = Array(\"id=1\", \"id=2\", \"id=3\", \"id=4\"), connectionProperties = new Properties())

                                                                              // Note the number of partitions (aka numPartitions) scala> df4parts.explain == Physical Plan == *Scan JDBCRelation(projects) [numPartitions=4] [id#16,name#17,website#18] ReadSchema: struct

                                                                              scala> df4parts.show(truncate = false) +---+------------+-----------------------+ |id |name |website | +---+------------+-----------------------+ |1 |Apache Spark|http://spark.apache.org| |2 |Apache Hive |http://hive.apache.org | |3 |Apache Kafka|http://kafka.apache.org| |4 |Apache Flink|http://flink.apache.org| +---+------------+-----------------------+

                                                                              === Troubleshooting

                                                                              If things can go wrong, they sooner or later go wrong. Here is a list of possible issues and their solutions.

                                                                              ==== java.sql.SQLException: No suitable driver

                                                                              Ensure that the JDBC driver sits on the CLASSPATH. Use spark-submit/index.md#driver-class-path[--driver-class-path] as described above (--packages or --jars do not work).

                                                                              scala> val df = spark.\n     |   read.\n     |   format(\"jdbc\").\n     |   options(opts).\n     |   load\njava.sql.SQLException: No suitable driver\n  at java.sql.DriverManager.getDriver(DriverManager.java:315)\n  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)\n  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)\n  at scala.Option.getOrElse(Option.scala:121)\n  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:83)\n  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:34)\n  at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)\n  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:301)\n  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)\n  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:158)\n  ... 52 elided\n

                                                                              === PostgreSQL Setup

                                                                              NOTE: I'm on Mac OS X so YMMV (aka Your Mileage May Vary).

                                                                              Use the sections to have a properly configured PostgreSQL database.

                                                                              • <>
                                                                              • <>
                                                                              • <>
                                                                              • <>
                                                                              • <>
                                                                              • <>
                                                                              • <>

                                                                                ==== [[installation]] Installation

                                                                                Install PostgreSQL as described in...TK

                                                                                CAUTION: This page serves as a cheatsheet for the author so he does not have to search Internet to find the installation steps.

                                                                                $ initdb /usr/local/var/postgres -E utf8\nThe files belonging to this database system will be owned by user \"jacek\".\nThis user must also own the server process.\n\nThe database cluster will be initialized with locale \"pl_pl.utf-8\".\ninitdb: could not find suitable text search configuration for locale \"pl_pl.utf-8\"\nThe default text search configuration will be set to \"simple\".\n\nData page checksums are disabled.\n\ncreating directory /usr/local/var/postgres ... ok\ncreating subdirectories ... ok\nselecting default max_connections ... 100\nselecting default shared_buffers ... 128MB\nselecting dynamic shared memory implementation ... posix\ncreating configuration files ... ok\ncreating template1 database in /usr/local/var/postgres/base/1 ... ok\ninitializing pg_authid ... ok\ninitializing dependencies ... ok\ncreating system views ... ok\nloading system objects' descriptions ... ok\ncreating collations ... ok\ncreating conversions ... ok\ncreating dictionaries ... ok\nsetting privileges on built-in objects ... ok\ncreating information schema ... ok\nloading PL/pgSQL server-side language ... ok\nvacuuming database template1 ... ok\ncopying template1 to template0 ... ok\ncopying template1 to postgres ... ok\nsyncing data to disk ... ok\n\nWARNING: enabling \"trust\" authentication for local connections\nYou can change this by editing pg_hba.conf or using the option -A, or\n--auth-local and --auth-host, the next time you run initdb.\n\nSuccess. You can now start the database server using:\n\n    pg_ctl -D /usr/local/var/postgres -l logfile start\n

                                                                                ==== [[starting-database-server]] Starting Database Server

                                                                                NOTE: Consult http://www.postgresql.org/docs/current/static/server-start.html[17.3. Starting the Database Server] in the official documentation.

                                                                                "},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#tip_1","title":"[TIP]","text":"

                                                                                Enable all logs in PostgreSQL to see query statements.

                                                                                log_statement = 'all'\n
                                                                                "},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#add-log_statement-all-to-usrlocalvarpostgrespostgresqlconf-on-mac-os-x-with-postgresql-installed-using-brew","title":"Add log_statement = 'all' to /usr/local/var/postgres/postgresql.conf on Mac OS X with PostgreSQL installed using brew.","text":"

                                                                                Start the database server using pg_ctl.

                                                                                $ pg_ctl -D /usr/local/var/postgres -l logfile start\nserver starting\n

                                                                                Alternatively, you can run the database server using postgres.

                                                                                $ postgres -D /usr/local/var/postgres\n

                                                                                ==== [[creating-database]] Create Database

                                                                                $ createdb sparkdb\n

                                                                                TIP: Consult http://www.postgresql.org/docs/current/static/app-createdb.html[createdb] in the official documentation.

                                                                                ==== Accessing Database

                                                                                Use psql sparkdb to access the database.

                                                                                $ psql sparkdb\npsql (9.6.2)\nType \"help\" for help.\n\nsparkdb=#\n

                                                                                Execute SELECT version() to know the version of the database server you have connected to.

                                                                                sparkdb=# SELECT version();\n                                                   version\n--------------------------------------------------------------------------------------------------------------\n PostgreSQL 9.6.2 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.2 (clang-700.1.81), 64-bit\n(1 row)\n

                                                                                Use \\h for help and \\q to leave a session.

                                                                                ==== Creating Table

                                                                                Create a table using CREATE TABLE command.

                                                                                CREATE TABLE projects (\n  id SERIAL PRIMARY KEY,\n  name text,\n  website text\n);\n

                                                                                Insert rows to initialize the table with data.

                                                                                INSERT INTO projects (name, website) VALUES ('Apache Spark', 'http://spark.apache.org');\nINSERT INTO projects (name, website) VALUES ('Apache Hive', 'http://hive.apache.org');\nINSERT INTO projects VALUES (DEFAULT, 'Apache Kafka', 'http://kafka.apache.org');\nINSERT INTO projects VALUES (DEFAULT, 'Apache Flink', 'http://flink.apache.org');\n

                                                                                Execute select * from projects; to ensure that you have the following records in projects table:

                                                                                sparkdb=# select * from projects;\n id |     name     |         website\n----+--------------+-------------------------\n  1 | Apache Spark | http://spark.apache.org\n  2 | Apache Hive  | http://hive.apache.org\n  3 | Apache Kafka | http://kafka.apache.org\n  4 | Apache Flink | http://flink.apache.org\n(4 rows)\n

                                                                                ==== Dropping Database

                                                                                $ dropdb sparkdb\n

                                                                                TIP: Consult http://www.postgresql.org/docs/current/static/app-dropdb.html[dropdb] in the official documentation.

                                                                                ==== Stopping Database Server

                                                                                pg_ctl -D /usr/local/var/postgres stop\n
                                                                                "},{"location":"exercises/spark-exercise-failing-stage/","title":"Causing Stage to Fail","text":"

                                                                                == Exercise: Causing Stage to Fail

                                                                                The example shows how Spark re-executes a stage in case of stage failure.

                                                                                === Recipe

                                                                                Start a Spark cluster, e.g. 1-node Hadoop YARN.

                                                                                start-yarn.sh\n
                                                                                // 2-stage job -- it _appears_ that a stage can be failed only when there is a shuffle\nsc.parallelize(0 to 3e3.toInt, 2).map(n => (n % 2, n)).groupByKey.count\n

                                                                                Use 2 executors at least so you can kill one and keep the application up and running (on one executor).

                                                                                YARN_CONF_DIR=hadoop-conf ./bin/spark-shell --master yarn \\\n  -c spark.shuffle.service.enabled=true \\\n  --num-executors 2\n
                                                                                "},{"location":"exercises/spark-exercise-pairrddfunctions-oneliners/","title":"One-liners using PairRDDFunctions","text":"

                                                                                == Exercise: One-liners using PairRDDFunctions

                                                                                This is a set of one-liners to give you a entry point into using rdd:PairRDDFunctions.md[PairRDDFunctions].

                                                                                === Exercise

                                                                                How would you go about solving a requirement to pair elements of the same key and creating a new RDD out of the matched values?

                                                                                "},{"location":"exercises/spark-exercise-pairrddfunctions-oneliners/#source-scala","title":"[source, scala]","text":"

                                                                                val users = Seq((1, \"user1\"), (1, \"user2\"), (2, \"user1\"), (2, \"user3\"), (3,\"user2\"), (3,\"user4\"), (3,\"user1\"))

                                                                                // Input RDD val us = sc.parallelize(users)

                                                                                // ...your code here

                                                                                // Desired output Seq(\"user1\",\"user2\"),(\"user1\",\"user3\"),(\"user1\",\"user4\"),(\"user2\",\"user4\"))

                                                                                "},{"location":"exercises/spark-exercise-standalone-master-ha/","title":"Spark Standalone - Using ZooKeeper for High-Availability of Master","text":"

                                                                                == Spark Standalone - Using ZooKeeper for High-Availability of Master

                                                                                TIP: Read ../spark-standalone-Master.md#recovery-mode[Recovery Mode] to know the theory.

                                                                                You're going to start two standalone Masters.

                                                                                You'll need 4 terminals (adjust addresses as needed):

                                                                                Start ZooKeeper.

                                                                                Create a configuration file ha.conf with the content as follows:

                                                                                spark.deploy.recoveryMode=ZOOKEEPER\nspark.deploy.zookeeper.url=<zookeeper_host>:2181\nspark.deploy.zookeeper.dir=/spark\n

                                                                                Start the first standalone Master.

                                                                                ./sbin/start-master.sh -h localhost -p 7077 --webui-port 8080 --properties-file ha.conf\n

                                                                                Start the second standalone Master.

                                                                                NOTE: It is not possible to start another instance of standalone Master on the same machine using ./sbin/start-master.sh. The reason is that the script assumes one instance per machine only. We're going to change the script to make it possible.

                                                                                $ cp ./sbin/start-master{,-2}.sh\n\n$ grep \"CLASS 1\" ./sbin/start-master-2.sh\n\"$\\{SPARK_HOME}/sbin\"/spark-daemon.sh start $CLASS 1 \\\n\n$ sed -i -e 's/CLASS 1/CLASS 2/' sbin/start-master-2.sh\n\n$ grep \"CLASS 1\" ./sbin/start-master-2.sh\n\n$ grep \"CLASS 2\" ./sbin/start-master-2.sh\n\"$\\{SPARK_HOME}/sbin\"/spark-daemon.sh start $CLASS 2 \\\n\n$ ./sbin/start-master-2.sh -h localhost -p 17077 --webui-port 18080 --properties-file ha.conf\n

                                                                                You can check how many instances you're currently running using jps command as follows:

                                                                                $ jps -lm\n5024 sun.tools.jps.Jps -lm\n4994 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8080 -h localhost -p 17077 --webui-port 18080 --properties-file ha.conf\n4808 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8080 -h localhost -p 7077 --webui-port 8080 --properties-file ha.conf\n4778 org.apache.zookeeper.server.quorum.QuorumPeerMain config/zookeeper.properties\n

                                                                                Start a standalone Worker.

                                                                                ./sbin/start-slave.sh spark://localhost:7077,localhost:17077\n

                                                                                Start Spark shell.

                                                                                ./bin/spark-shell --master spark://localhost:7077,localhost:17077\n

                                                                                Wait till the Spark shell connects to an active standalone Master.

                                                                                Find out which standalone Master is active (there can only be one). Kill it. Observe how the other standalone Master takes over and lets the Spark shell register with itself. Check out the master's UI.

                                                                                Optionally, kill the worker, make sure it goes away instantly in the active master's logs.

                                                                                "},{"location":"exercises/spark-exercise-take-multiple-jobs/","title":"Learning Jobs and Partitions Using take Action","text":"

                                                                                == Exercise: Learning Jobs and Partitions Using take Action

                                                                                The exercise aims for introducing take action and using spark-shell and web UI. It should introduce you to the concepts of partitions and jobs.

                                                                                The following snippet creates an RDD of 16 elements with 16 partitions.

                                                                                scala> val r1 = sc.parallelize(0 to 15, 16)\nr1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at parallelize at <console>:18\n\nscala> r1.partitions.size\nres63: Int = 16\n\nscala> r1.foreachPartition(it => println(\">>> partition size: \" + it.size))\n...\n>>> partition size: 1\n>>> partition size: 1\n>>> partition size: 1\n>>> partition size: 1\n>>> partition size: 1\n>>> partition size: 1\n>>> partition size: 1\n>>> partition size: 1\n... // the machine has 8 cores\n... // so first 8 tasks get executed immediately\n... // with the others after a core is free to take on new tasks.\n>>> partition size: 1\n...\n>>> partition size: 1\n...\n>>> partition size: 1\n...\n>>> partition size: 1\n>>> partition size: 1\n...\n>>> partition size: 1\n>>> partition size: 1\n>>> partition size: 1\n

                                                                                All 16 partitions have one element.

                                                                                When you execute r1.take(1) only one job gets run since it is enough to compute one task on one partition.

                                                                                CAUTION: FIXME Snapshot from web UI - note the number of tasks

                                                                                However, when you execute r1.take(2) two jobs get run as the implementation assumes one job with one partition, and if the elements didn't total to the number of elements requested in take, quadruple the partitions to work on in the following jobs.

                                                                                CAUTION: FIXME Snapshot from web UI - note the number of tasks

                                                                                Can you guess how many jobs are run for r1.take(15)? How many tasks per job?

                                                                                CAUTION: FIXME Snapshot from web UI - note the number of tasks

                                                                                Answer: 3.

                                                                                "},{"location":"exercises/spark-first-app/","title":"Your first complete Spark application (using Scala and sbt)","text":"

                                                                                == Your first Spark application (using Scala and sbt)

                                                                                This page gives you the exact steps to develop and run a complete Spark application using http://www.scala-lang.org/[Scala] programming language and http://www.scala-sbt.org/[sbt] as the build tool.

                                                                                [TIP] Refer to Quick Start's http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/quick-start.html#self-contained-applications[Self-Contained Applications] in the official documentation.

                                                                                The sample application called SparkMe App is...FIXME

                                                                                === Overview

                                                                                You're going to use http://www.scala-sbt.org/[sbt] as the project build tool. It uses build.sbt for the project's description as well as the dependencies, i.e. the version of Apache Spark and others.

                                                                                The application's main code is under src/main/scala directory, in SparkMeApp.scala file.

                                                                                With the files in a directory, executing sbt package results in a package that can be deployed onto a Spark cluster using spark-submit.

                                                                                In this example, you're going to use Spark's local/spark-local.md[local mode].

                                                                                === Project's build - build.sbt

                                                                                Any Scala project managed by sbt uses build.sbt as the central place for configuration, including project dependencies denoted as libraryDependencies.

                                                                                build.sbt

                                                                                name         := \"SparkMe Project\"\nversion      := \"1.0\"\norganization := \"pl.japila\"\n\nscalaVersion := \"2.11.7\"\n\nlibraryDependencies += \"org.apache.spark\" %% \"spark-core\" % \"1.6.0-SNAPSHOT\"  // <1>\nresolvers += Resolver.mavenLocal\n
                                                                                <1> Use the development version of Spark 1.6.0-SNAPSHOT

                                                                                === SparkMe Application

                                                                                The application uses a single command-line parameter (as args(0)) that is the file to process. The file is read and the number of lines printed out.

                                                                                package pl.japila.spark\n\nimport org.apache.spark.{SparkContext, SparkConf}\n\nobject SparkMeApp {\n  def main(args: Array[String]) {\n    val conf = new SparkConf().setAppName(\"SparkMe Application\")\n    val sc = new SparkContext(conf)\n\n    val fileName = args(0)\n    val lines = sc.textFile(fileName).cache\n\n    val c = lines.count\n    println(s\"There are $c lines in $fileName\")\n  }\n}\n

                                                                                === sbt version - project/build.properties

                                                                                sbt (launcher) uses project/build.properties file to set (the real) sbt up

                                                                                sbt.version=0.13.9\n

                                                                                TIP: With the file the build is more predictable as the version of sbt doesn't depend on the sbt launcher.

                                                                                === Packaging Application

                                                                                Execute sbt package to package the application.

                                                                                \u279c  sparkme-app  sbt package\n[info] Loading global plugins from /Users/jacek/.sbt/0.13/plugins\n[info] Loading project definition from /Users/jacek/dev/sandbox/sparkme-app/project\n[info] Set current project to SparkMe Project (in build file:/Users/jacek/dev/sandbox/sparkme-app/)\n[info] Compiling 1 Scala source to /Users/jacek/dev/sandbox/sparkme-app/target/scala-2.11/classes...\n[info] Packaging /Users/jacek/dev/sandbox/sparkme-app/target/scala-2.11/sparkme-project_2.11-1.0.jar ...\n[info] Done packaging.\n[success] Total time: 3 s, completed Sep 23, 2015 12:47:52 AM\n

                                                                                The application uses only classes that comes with Spark so package is enough.

                                                                                In target/scala-2.11/sparkme-project_2.11-1.0.jar there is the final application ready for deployment.

                                                                                === Submitting Application to Spark (local)

                                                                                NOTE: The application is going to be deployed to local[*]. Change it to whatever cluster you have available (refer to spark-cluster.md[Running Spark in cluster]).

                                                                                spark-submit the SparkMe application and specify the file to process (as it is the only and required input parameter to the application), e.g. build.sbt of the project.

                                                                                NOTE: build.sbt is sbt's build definition and is only used as an input file for demonstration purposes. Any file is going to work fine.

                                                                                \u279c  sparkme-app  ~/dev/oss/spark/bin/spark-submit --master \"local[*]\" --class pl.japila.spark.SparkMeApp target/scala-2.11/sparkme-project_2.11-1.0.jar build.sbt\nUsing Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties\nTo adjust logging level use sc.setLogLevel(\"INFO\")\n15/09/23 01:06:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n15/09/23 01:06:04 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.\nThere are 8 lines in build.sbt\n

                                                                                NOTE: Disregard the two above WARN log messages.

                                                                                You're done. Sincere congratulations!

                                                                                "},{"location":"exercises/spark-hello-world-using-spark-shell/","title":"Spark's Hello World using Spark shell and Scala","text":"

                                                                                == Exercise: Spark's Hello World using Spark shell and Scala

                                                                                Run Spark shell and count the number of words in a file using MapReduce pattern.

                                                                                • Use sc.textFile to read the file into memory
                                                                                • Use RDD.flatMap for a mapper step
                                                                                • Use reduceByKey for a reducer step
                                                                                "},{"location":"exercises/spark-sql-hive-orc-example/","title":"Using Spark SQL to update data in Hive using ORC files","text":"

                                                                                == Using Spark SQL to update data in Hive using ORC files

                                                                                The example has showed up on Spark's users mailing list.

                                                                                "},{"location":"exercises/spark-sql-hive-orc-example/#caution","title":"[CAUTION]","text":"
                                                                                • FIXME Offer a complete working solution in Scala
                                                                                • FIXME Load ORC files into dataframe ** val df = hiveContext.read.format(\"orc\").load(to/path) ====

                                                                                Solution was to use Hive in ORC format with partitions:

                                                                                • A table in Hive stored as an ORC file (using partitioning)
                                                                                • Using SQLContext.sql to insert data into the table
                                                                                • Using SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to merge your many small files into larger files optimized for your HDFS block size ** Since the CONCATENATE command operates on files in place it is transparent to any downstream processing
                                                                                • Hive solution is just to concatenate the files ** it does not alter or change records. ** it's possible to update data in Hive using ORC format ** With transactional tables in Hive together with insert, update, delete, it does the \"concatenate \" for you automatically in regularly intervals. Currently this works only with tables in orc.format (stored as orc) ** Alternatively, use Hbase with Phoenix as the SQL layer on top ** Hive was originally not designed for updates, because it was.purely warehouse focused, the most recent one can do updates, deletes etc in a transactional way.

                                                                                Criteria:

                                                                                • spark-streaming/spark-streaming.md[Spark Streaming] jobs are receiving a lot of small events (avg 10kb)
                                                                                • Events are stored to HDFS, e.g. for Pig jobs
                                                                                • There are a lot of small files in HDFS (several millions)
                                                                                "},{"location":"external-shuffle-service/","title":"External Shuffle Service","text":"

                                                                                External Shuffle Service is a Spark service to serve RDD and shuffle blocks outside and for Executors.

                                                                                ExternalShuffleService can be started as a command-line application or automatically as part of a worker node in a Spark cluster (e.g. Spark Standalone).

                                                                                External Shuffle Service is enabled in a Spark application using spark.shuffle.service.enabled configuration property.

                                                                                "},{"location":"external-shuffle-service/ExecutorShuffleInfo/","title":"ExecutorShuffleInfo","text":"

                                                                                ExecutorShuffleInfo is...FIXME

                                                                                "},{"location":"external-shuffle-service/ExternalBlockHandler/","title":"ExternalBlockHandler","text":"

                                                                                ExternalBlockHandler is an RpcHandler.

                                                                                "},{"location":"external-shuffle-service/ExternalBlockHandler/#creating-instance","title":"Creating Instance","text":"

                                                                                ExternalBlockHandler takes the following to be created:

                                                                                • TransportConf
                                                                                • Registered Executors File
                                                                                • ExternalBlockHandler creates the following:

                                                                                  • ShuffleMetrics
                                                                                  • OneForOneStreamManager
                                                                                  • ExternalShuffleBlockResolver

                                                                                  ExternalBlockHandler is created\u00a0when:

                                                                                  • ExternalShuffleService is requested for an ExternalBlockHandler
                                                                                  • YarnShuffleService is requested to serviceInit
                                                                                  "},{"location":"external-shuffle-service/ExternalBlockHandler/#oneforonestreammanager","title":"OneForOneStreamManager

                                                                                  ExternalBlockHandler can be given or creates an OneForOneStreamManager when created.

                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#externalshuffleblockresolver","title":"ExternalShuffleBlockResolver

                                                                                  ExternalBlockHandler can be given or creates an ExternalShuffleBlockResolver to be created.

                                                                                  ExternalShuffleBlockResolver is used for the following:

                                                                                  • registerExecutor when ExternalBlockHandler is requested to handle a RegisterExecutor message
                                                                                  • removeBlocks when ExternalBlockHandler is requested to handle a RemoveBlocks message
                                                                                  • getLocalDirs when ExternalBlockHandler is requested to handle a GetLocalDirsForExecutors message
                                                                                  • applicationRemoved when ExternalBlockHandler is requested to applicationRemoved
                                                                                  • executorRemoved when ExternalBlockHandler is requested to executorRemoved
                                                                                  • registerExecutor when ExternalBlockHandler is requested to reregisterExecutor

                                                                                  ExternalShuffleBlockResolver is used for the following:

                                                                                  • getBlockData and getRddBlockData for ManagedBufferIterator
                                                                                  • getBlockData and getContinuousBlocksData for ShuffleManagedBufferIterator

                                                                                  ExternalShuffleBlockResolver is closed when is ExternalBlockHandler.

                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#registered-executors-file","title":"Registered Executors File

                                                                                  ExternalBlockHandler can be given a Java's File (or null) to be created.

                                                                                  This file is simply to create an ExternalShuffleBlockResolver.

                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#messages","title":"Messages","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#fetchshuffleblocks","title":"FetchShuffleBlocks

                                                                                  Request to read a set of blocks

                                                                                  \"Posted\" (created) when:

                                                                                  • OneForOneBlockFetcher is requested to createFetchShuffleBlocksMsg

                                                                                  When received, ExternalBlockHandler requests the OneForOneStreamManager to registerStream (with a ShuffleManagedBufferIterator).

                                                                                  ExternalBlockHandler prints out the following TRACE message to the logs:

                                                                                  Registered streamId [streamId] with [numBlockIds] buffers for client [clientId] from host [remoteAddress]\n

                                                                                  In the end, ExternalBlockHandler responds with a StreamHandle (of streamId and numBlockIds).

                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#getlocaldirsforexecutors","title":"GetLocalDirsForExecutors","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#openblocks","title":"OpenBlocks

                                                                                  Note

                                                                                  For backward compatibility and like FetchShuffleBlocks.

                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#registerexecutor","title":"RegisterExecutor","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#removeblocks","title":"RemoveBlocks","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#shufflemetrics","title":"ShuffleMetrics","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#executor-removed-notification","title":"Executor Removed Notification
                                                                                  void executorRemoved(\n  String executorId,\n  String appId)\n

                                                                                  executorRemoved requests the ExternalShuffleBlockResolver to executorRemoved.

                                                                                  executorRemoved\u00a0is used when:

                                                                                  • ExternalShuffleService is requested to executorRemoved
                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#application-finished-notification","title":"Application Finished Notification
                                                                                  void applicationRemoved(\n  String appId,\n  boolean cleanupLocalDirs)\n

                                                                                  applicationRemoved requests the ExternalShuffleBlockResolver to applicationRemoved.

                                                                                  applicationRemoved\u00a0is used when:

                                                                                  • ExternalShuffleService is requested to applicationRemoved
                                                                                  • YarnShuffleService (Spark on YARN) is requested to stopApplication
                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#logging","title":"Logging

                                                                                  Enable ALL logging level for org.apache.spark.network.shuffle.ExternalBlockHandler logger to see what happens inside.

                                                                                  Add the following line to conf/log4j.properties:

                                                                                  log4j.logger.org.apache.spark.network.shuffle.ExternalBlockHandler=ALL\n

                                                                                  Refer to Logging.

                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/","title":"ExternalShuffleBlockResolver","text":"

                                                                                  ExternalShuffleBlockResolver manages converting shuffle BlockIds into physical segments of local files (from a process outside of Executors).

                                                                                  "},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#creating-instance","title":"Creating Instance","text":"

                                                                                  ExternalShuffleBlockResolver takes the following to be created:

                                                                                  • TransportConf
                                                                                  • registeredExecutor File (Java's File)
                                                                                  • Directory Cleaner
                                                                                  • ExternalShuffleBlockResolver is created\u00a0when:

                                                                                    • ExternalBlockHandler is created
                                                                                    "},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#executors","title":"Executors

                                                                                    ExternalShuffleBlockResolver uses a mapping of ExecutorShuffleInfos by AppExecId.

                                                                                    ExternalShuffleBlockResolver can (re)load this mapping from a registeredExecutor file or simply start from scratch.

                                                                                    A new mapping is added when registering an executor.

                                                                                    ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#directory-cleaner-executor","title":"Directory Cleaner Executor

                                                                                    ExternalShuffleBlockResolver can be given a Java Executor or use a single worker thread executor (with spark-shuffle-directory-cleaner thread prefix).

                                                                                    The Executor is used to schedule a thread to clean up executor's local directories and non-shuffle and non-RDD files in executor's local directories.

                                                                                    ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#sparkshuffleservicefetchrddenabled","title":"spark.shuffle.service.fetch.rdd.enabled

                                                                                    ExternalShuffleBlockResolver uses spark.shuffle.service.fetch.rdd.enabled configuration property to control whether or not to remove cached RDD files (alongside shuffle output files).

                                                                                    ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#registering-executor","title":"Registering Executor
                                                                                    void registerExecutor(\n  String appId,\n  String execId,\n  ExecutorShuffleInfo executorInfo)\n

                                                                                    registerExecutor...FIXME

                                                                                    registerExecutor is used when:

                                                                                    • ExternalBlockHandler is requested to handle a RegisterExecutor message and reregisterExecutor
                                                                                    ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#cleaning-up-local-directories-for-removed-executor","title":"Cleaning Up Local Directories for Removed Executor
                                                                                    void executorRemoved(\n  String executorId,\n  String appId)\n

                                                                                    executorRemoved prints out the following INFO message to the logs:

                                                                                    Clean up non-shuffle and non-RDD files associated with the finished executor [executorId]\n

                                                                                    executorRemoved looks up the executor in the executors internal registry.

                                                                                    When found, executorRemoved prints out the following INFO message to the logs and requests the Directory Cleaner Executor to execute asynchronous deletion of the executor's local directories (on a separate thread).

                                                                                    Cleaning up non-shuffle and non-RDD files in executor [AppExecId]'s [localDirs] local dirs\n

                                                                                    When not found, executorRemoved prints out the following INFO message to the logs:

                                                                                    Executor is not registered (appId=[appId], execId=[executorId])\n

                                                                                    executorRemoved\u00a0is used when:

                                                                                    • ExternalBlockHandler is requested to executorRemoved
                                                                                    ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#deletenonshuffleserviceservedfiles","title":"deleteNonShuffleServiceServedFiles
                                                                                    void deleteNonShuffleServiceServedFiles(\n  String[] dirs)\n

                                                                                    deleteNonShuffleServiceServedFiles creates a Java FilenameFilter for files that meet all of the following:

                                                                                    1. A file name does not end with .index or .data
                                                                                    2. With rddFetchEnabled is enabled, a file name does not start with rdd_ prefix

                                                                                    deleteNonShuffleServiceServedFiles deletes files and directories (based on the FilenameFilter) in every directory (in the input dirs).

                                                                                    deleteNonShuffleServiceServedFiles prints out the following DEBUG message to the logs:

                                                                                    Successfully cleaned up files not served by shuffle service in directory: [localDir]\n

                                                                                    In case of any exceptions, deleteNonShuffleServiceServedFiles prints out the following ERROR message to the logs:

                                                                                    Failed to delete files not served by shuffle service in directory: [localDir]\n
                                                                                    ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#application-removed-notification","title":"Application Removed Notification
                                                                                    void applicationRemoved(\n  String appId,\n  boolean cleanupLocalDirs)\n

                                                                                    applicationRemoved...FIXME

                                                                                    applicationRemoved is used when:

                                                                                    • ExternalBlockHandler is requested to applicationRemoved
                                                                                    ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#deleteexecutordirs","title":"deleteExecutorDirs
                                                                                    void deleteExecutorDirs(\n  String[] dirs)\n

                                                                                    deleteExecutorDirs...FIXME

                                                                                    ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#fetching-block-data","title":"Fetching Block Data
                                                                                    ManagedBuffer getBlockData(\n  String appId,\n  String execId,\n  int shuffleId,\n  long mapId,\n  int reduceId)\n

                                                                                    getBlockData...FIXME

                                                                                    getBlockData is used when:

                                                                                    • ManagedBufferIterator is created
                                                                                    • ShuffleManagedBufferIterator is requested for next ManagedBuffer
                                                                                    ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#logging","title":"Logging

                                                                                    Enable ALL logging level for org.apache.spark.network.shuffle.ExternalShuffleBlockResolver logger to see what happens inside.

                                                                                    Add the following line to conf/log4j.properties:

                                                                                    log4j.logger.org.apache.spark.network.shuffle.ExternalShuffleBlockResolver=ALL\n

                                                                                    Refer to Logging.

                                                                                    ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/","title":"ExternalShuffleService","text":"

                                                                                    ExternalShuffleService is a Spark service that can serve RDD and shuffle blocks.

                                                                                    ExternalShuffleService manages shuffle output files so they are available to executors. As the shuffle output files are managed externally to the executors it offers an uninterrupted access to the shuffle output files regardless of executors being killed or down (esp. with Dynamic Allocation of Executors).

                                                                                    ExternalShuffleService can be launched from command line.

                                                                                    ExternalShuffleService is enabled on the driver and executors using spark.shuffle.service.enabled configuration property.

                                                                                    Note

                                                                                    Spark on YARN uses a custom external shuffle service (YarnShuffleService).

                                                                                    "},{"location":"external-shuffle-service/ExternalShuffleService/#launching-externalshuffleservice","title":"Launching ExternalShuffleService

                                                                                    ExternalShuffleService can be launched as a standalone application using spark-class.

                                                                                    spark-class org.apache.spark.deploy.ExternalShuffleService\n
                                                                                    ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#main-entry-point","title":"main Entry Point
                                                                                    main(\n  args: Array[String]): Unit\n

                                                                                    main is the entry point of ExternalShuffleService standalone application.

                                                                                    main prints out the following INFO message to the logs:

                                                                                    Started daemon with process name: [name]\n

                                                                                    main registers signal handlers for TERM, HUP, INT signals.

                                                                                    main loads the default Spark properties.

                                                                                    main creates a SecurityManager.

                                                                                    main turns spark.shuffle.service.enabled to true explicitly (since this service is started from the command line for a reason).

                                                                                    main creates an ExternalShuffleService and starts it.

                                                                                    main prints out the following DEBUG message to the logs:

                                                                                    Adding shutdown hook\n

                                                                                    main registers a shutdown hook. When triggered, the shutdown hook prints the following INFO message to the logs and requests the ExternalShuffleService to stop.

                                                                                    Shutting down shuffle service.\n
                                                                                    ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#creating-instance","title":"Creating Instance

                                                                                    ExternalShuffleService takes the following to be created:

                                                                                    • SparkConf
                                                                                    • SecurityManager

                                                                                      ExternalShuffleService is created\u00a0when:

                                                                                      • ExternalShuffleService standalone application is started
                                                                                      • Worker (Spark Standalone) is created (and initializes an ExternalShuffleService)
                                                                                      ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#transportserver","title":"TransportServer
                                                                                      server: TransportServer\n

                                                                                      ExternalShuffleService uses an internal reference to a TransportServer that is created when ExternalShuffleService is started.

                                                                                      ExternalShuffleService uses an ExternalBlockHandler to handle RPC messages (and serve RDD blocks and shuffle blocks).

                                                                                      TransportServer is requested to close when ExternalShuffleService is requested to stop.

                                                                                      TransportServer is used for metrics.

                                                                                      ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#port","title":"Port

                                                                                      ExternalShuffleService uses spark.shuffle.service.port configuration property for the port to listen to when started.

                                                                                      ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#sparkshuffleserviceenabled","title":"spark.shuffle.service.enabled

                                                                                      ExternalShuffleService uses spark.shuffle.service.enabled configuration property to control whether or not is enabled (and should be started when requested).

                                                                                      ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#externalblockhandler","title":"ExternalBlockHandler
                                                                                      blockHandler: ExternalBlockHandler\n

                                                                                      ExternalShuffleService creates an ExternalBlockHandler when created.

                                                                                      With spark.shuffle.service.db.enabled and spark.shuffle.service.enabled configuration properties enabled, the ExternalBlockHandler is given a local directory with a registeredExecutors.ldb file.

                                                                                      blockHandler\u00a0is used to create a TransportContext that creates the TransportServer.

                                                                                      blockHandler\u00a0is used when:

                                                                                      • applicationRemoved
                                                                                      • executorRemoved
                                                                                      ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#findregisteredexecutorsdbfile","title":"findRegisteredExecutorsDBFile
                                                                                      findRegisteredExecutorsDBFile(\n  dbName: String): File\n

                                                                                      findRegisteredExecutorsDBFile returns one of the local directories (defined using spark.local.dir configuration property) with the input dbName file or null when no directories defined.

                                                                                      findRegisteredExecutorsDBFile searches the local directories (defined using spark.local.dir configuration property) for the input dbName file. Unless found, findRegisteredExecutorsDBFile takes the first local directory.

                                                                                      With no local directories defined in spark.local.dir configuration property, findRegisteredExecutorsDBFile prints out the following WARN message to the logs and returns null.

                                                                                      'spark.local.dir' should be set first when we use db in ExternalShuffleService. Note that this only affects standalone mode.\n
                                                                                      ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#starting-externalshuffleservice","title":"Starting ExternalShuffleService
                                                                                      start(): Unit\n

                                                                                      start prints out the following INFO message to the logs:

                                                                                      Starting shuffle service on port [port] (auth enabled = [authEnabled])\n

                                                                                      start creates a AuthServerBootstrap with authentication enabled (using SecurityManager).

                                                                                      start creates a TransportContext (with the ExternalBlockHandler) and requests it to create a server (on the port).

                                                                                      start...FIXME

                                                                                      start\u00a0is used when:

                                                                                      • ExternalShuffleService is requested to startIfEnabled and is launched (as a command-line application)
                                                                                      ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#startifenabled","title":"startIfEnabled
                                                                                      startIfEnabled(): Unit\n

                                                                                      startIfEnabled starts the external shuffle service if enabled.

                                                                                      startIfEnabled\u00a0is used when:

                                                                                      • Worker (Spark Standalone) is requested to startExternalShuffleService
                                                                                      ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#executor-removed-notification","title":"Executor Removed Notification
                                                                                      executorRemoved(\n  executorId: String,\n  appId: String): Unit\n

                                                                                      executorRemoved requests the ExternalBlockHandler to executorRemoved.

                                                                                      executorRemoved\u00a0is used when:

                                                                                      • Worker (Spark Standalone) is requested to handleExecutorStateChanged
                                                                                      ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#application-finished-notification","title":"Application Finished Notification
                                                                                      applicationRemoved(\n  appId: String): Unit\n

                                                                                      applicationRemoved requests the ExternalBlockHandler to applicationRemoved (with cleanupLocalDirs flag enabled).

                                                                                      applicationRemoved\u00a0is used when:

                                                                                      • Worker (Spark Standalone) is requested to handle WorkDirCleanup message and maybeCleanupApplication
                                                                                      ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#logging","title":"Logging

                                                                                      Enable ALL logging level for org.apache.spark.deploy.ExternalShuffleService logger to see what happens inside.

                                                                                      Add the following line to conf/log4j.properties:

                                                                                      log4j.logger.org.apache.spark.deploy.ExternalShuffleService=ALL\n

                                                                                      Refer to Logging.

                                                                                      ","text":""},{"location":"external-shuffle-service/configuration-properties/","title":"Spark Configuration Properties of External Shuffle Service","text":"

                                                                                      The following are configuration properties of External Shuffle Service.

                                                                                      "},{"location":"external-shuffle-service/configuration-properties/#sparkshuffleservicedbenabled","title":"spark.shuffle.service.db.enabled

                                                                                      Whether to use db in ExternalShuffleService. Note that this only affects standalone mode.

                                                                                      Default: true

                                                                                      Used when:

                                                                                      • ExternalShuffleService is requested for an ExternalBlockHandler
                                                                                      • Worker (Spark Standalone) is requested to handle a WorkDirCleanup message
                                                                                      ","text":""},{"location":"external-shuffle-service/configuration-properties/#sparkshuffleserviceenabled","title":"spark.shuffle.service.enabled

                                                                                      Controls whether to use the External Shuffle Service

                                                                                      Default: false

                                                                                      Note

                                                                                      LocalSparkCluster turns this property off explicitly when started.

                                                                                      Used when:

                                                                                      • BlacklistTracker is requested to updateBlacklistForFetchFailure
                                                                                      • ExecutorMonitor is created
                                                                                      • ExecutorAllocationManager is requested to validateSettings
                                                                                      • SparkEnv utility is requested to create a \"base\" SparkEnv
                                                                                      • ExternalShuffleService is created and started
                                                                                      • Worker (Spark Standalone) is requested to handle a WorkDirCleanup message or started
                                                                                      • ExecutorRunnable (Spark on YARN) is requested to startContainer
                                                                                      ","text":""},{"location":"external-shuffle-service/configuration-properties/#sparkshuffleservicefetchrddenabled","title":"spark.shuffle.service.fetch.rdd.enabled

                                                                                      Enables ExternalShuffleService for fetching disk persisted RDD blocks.

                                                                                      When enabled with Dynamic Resource Allocation executors having only disk persisted blocks are considered idle after spark.dynamicAllocation.executorIdleTimeout and will be released accordingly.

                                                                                      Default: false

                                                                                      Used when:

                                                                                      • ExternalShuffleBlockResolver is created
                                                                                      • SparkEnv utility is requested to create a \"base\" SparkEnv
                                                                                      • ExecutorMonitor is created
                                                                                      ","text":""},{"location":"external-shuffle-service/configuration-properties/#sparkshuffleserviceport","title":"spark.shuffle.service.port

                                                                                      Port of the external shuffle service

                                                                                      Default: 7337

                                                                                      Used when:

                                                                                      • ExternalShuffleService is created
                                                                                      • StorageUtils utility is requested for the port of an external shuffle service
                                                                                      ","text":""},{"location":"features/","title":"Features","text":""},{"location":"history-server/","title":"Spark History Server","text":"

                                                                                      Spark History Server is the web UI of Spark applications with event log collection enabled (based on spark.eventLog.enabled configuration property).

                                                                                      Spark History Server is an extension of Spark's web UI.

                                                                                      Spark History Server can be started using start-history-server.sh and stopped using stop-history-server.sh shell scripts.

                                                                                      Spark History Server supports custom configuration properties that can be defined using --properties-file [propertiesFile] command-line option. The properties file can have any valid spark.-prefixed Spark property.

                                                                                      $ ./sbin/start-history-server.sh --properties-file history.properties\n

                                                                                      If not specified explicitly, Spark History Server uses the default configuration file, i.e. spark-defaults.conf.

                                                                                      Spark History Server can replay events from event log files recorded by EventLoggingListener.

                                                                                      "},{"location":"history-server/#start-history-serversh-shell-script","title":"start-history-server.sh Shell Script

                                                                                      $SPARK_HOME/sbin/start-history-server.sh shell script (where SPARK_HOME is the directory of your Spark installation) is used to start a Spark History Server instance.

                                                                                      $ ./sbin/start-history-server.sh\nstarting org.apache.spark.deploy.history.HistoryServer, logging to .../spark/logs/spark-jacek-org.apache.spark.deploy.history.HistoryServer-1-japila.out\n

                                                                                      Internally, start-history-server.sh script starts org.apache.spark.deploy.history.HistoryServer standalone application (using spark-daemon.sh shell script).

                                                                                      $ ./bin/spark-class org.apache.spark.deploy.history.HistoryServer\n

                                                                                      Tip

                                                                                      Using the more explicit approach with spark-class to start Spark History Server could be easier to trace execution by seeing the logs printed out to the standard output and hence terminal directly.

                                                                                      When started, start-history-server.sh prints out the following INFO message to the logs:

                                                                                      Started daemon with process name: [processName]\n

                                                                                      start-history-server.sh registers signal handlers (using SignalUtils) for TERM, HUP, INT to log their execution:

                                                                                      RECEIVED SIGNAL [signal]\n

                                                                                      start-history-server.sh inits security if enabled (based on spark.history.kerberos.enabled configuration property).

                                                                                      start-history-server.sh creates a SecurityManager.

                                                                                      start-history-server.sh creates a ApplicationHistoryProvider (based on spark.history.provider configuration property).

                                                                                      In the end, start-history-server.sh creates a HistoryServer and requests it to bind to the port (based on spark.history.ui.port configuration property).

                                                                                      Note

                                                                                      The host's IP can be specified using SPARK_LOCAL_IP environment variable (defaults to 0.0.0.0).

                                                                                      start-history-server.sh prints out the following INFO message to the logs:

                                                                                      Bound HistoryServer to [host], and started at [webUrl]\n

                                                                                      start-history-server.sh registers a shutdown hook to call stop on the HistoryServer instance.

                                                                                      ","text":""},{"location":"history-server/#stop-history-serversh-shell-script","title":"stop-history-server.sh Shell Script

                                                                                      $SPARK_HOME/sbin/stop-history-server.sh shell script (where SPARK_HOME is the directory of your Spark installation) is used to stop a running instance of Spark History Server.

                                                                                      $ ./sbin/stop-history-server.sh\nstopping org.apache.spark.deploy.history.HistoryServer\n
                                                                                      ","text":""},{"location":"history-server/ApplicationCache/","title":"ApplicationCache","text":"

                                                                                      == [[ApplicationCache]] ApplicationCache

                                                                                      ApplicationCache is...FIXME

                                                                                      ApplicationCache is <> exclusively when HistoryServer is HistoryServer.md#appCache[created].

                                                                                      ApplicationCache uses https://github.com/google/guava/wiki/Release14[Google Guava 14.0.1] library for the internal <>.

                                                                                      [[internal-registries]] .ApplicationCache's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                      | appLoader | [[appLoader]] Google Guava's https://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/CacheLoader.html[CacheLoader] with a custom ++https://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/CacheLoader.html#load(K)++[load] which is simply <>.

                                                                                      Used when...FIXME

                                                                                      | removalListener | [[removalListener]]

                                                                                      | appCache a| [[appCache]] Google Guava's https://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/LoadingCache.html[LoadingCache] of CacheKey keys and CacheEntry entries

                                                                                      Used when ApplicationCache is requested for the following:

                                                                                      • <> given appId and attemptId IDs

                                                                                      • FIXME (other uses)

                                                                                      • | metrics | [[metrics]] |===

                                                                                        === [[creating-instance]] Creating ApplicationCache Instance

                                                                                        ApplicationCache takes the following when created:

                                                                                        • [[operations]] ApplicationCacheOperations.md[ApplicationCacheOperations]
                                                                                        • [[retainedApplications]] retainedApplications
                                                                                        • [[clock]] Clock

                                                                                        ApplicationCache initializes the <>.

                                                                                        === [[loadApplicationEntry]] loadApplicationEntry Internal Method

                                                                                        "},{"location":"history-server/ApplicationCache/#source-scala","title":"[source, scala]","text":""},{"location":"history-server/ApplicationCache/#loadapplicationentryappid-string-attemptid-optionstring-cacheentry","title":"loadApplicationEntry(appId: String, attemptId: Option[String]): CacheEntry","text":"

                                                                                        loadApplicationEntry...FIXME

                                                                                        NOTE: loadApplicationEntry is used exclusively when ApplicationCache is requested to <>.

                                                                                        === [[load]] Loading Cached Spark Application UI -- load Method

                                                                                        "},{"location":"history-server/ApplicationCache/#source-scala_1","title":"[source, scala]","text":""},{"location":"history-server/ApplicationCache/#loadkey-cachekey-cacheentry","title":"load(key: CacheKey): CacheEntry","text":"

                                                                                        NOTE: load is part of Google Guava's https://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/CacheLoader.html[CacheLoader] to retrieve a CacheEntry, based on a CacheKey, for <>.

                                                                                        load simply relays to <> with the appId and attemptId of the input CacheKey.

                                                                                        === [[get]] Requesting Cached UI of Spark Application (CacheEntry) -- get Method

                                                                                        "},{"location":"history-server/ApplicationCache/#source-scala_2","title":"[source, scala]","text":""},{"location":"history-server/ApplicationCache/#getappid-string-attemptid-optionstring-none-cacheentry","title":"get(appId: String, attemptId: Option[String] = None): CacheEntry","text":"

                                                                                        get...FIXME

                                                                                        NOTE: get is used exclusively when ApplicationCache is requested to <>.

                                                                                        === [[withSparkUI]] Executing Closure While Holding Application's UI Read Lock -- withSparkUI Method

                                                                                        "},{"location":"history-server/ApplicationCache/#source-scala_3","title":"[source, scala]","text":""},{"location":"history-server/ApplicationCache/#withsparkuitfn-sparkui-t-t","title":"withSparkUIT(fn: SparkUI => T): T","text":"

                                                                                        withSparkUI...FIXME

                                                                                        NOTE: withSparkUI is used when HistoryServer is requested to HistoryServer.md#withSparkUI[withSparkUI] and HistoryServer.md#loadAppUi[loadAppUi].

                                                                                        "},{"location":"history-server/ApplicationCacheOperations/","title":"ApplicationCacheOperations","text":"

                                                                                        == [[ApplicationCacheOperations]] ApplicationCacheOperations

                                                                                        ApplicationCacheOperations is the <> of...FIXME

                                                                                        [[contract]] [source, scala]

                                                                                        package org.apache.spark.deploy.history

                                                                                        trait ApplicationCacheOperations { // only required methods that have no implementation // the others follow def getAppUI(appId: String, attemptId: Option[String]): Option[LoadedAppUI] def attachSparkUI( appId: String, attemptId: Option[String], ui: SparkUI, completed: Boolean): Unit def detachSparkUI(appId: String, attemptId: Option[String], ui: SparkUI): Unit }

                                                                                        NOTE: ApplicationCacheOperations is a private[history] contract.

                                                                                        .(Subset of) ApplicationCacheOperations Contract [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                        | getAppUI | [[getAppUI]] spark-webui-SparkUI.md[SparkUI] (the UI of a Spark application)

                                                                                        Used exclusively when ApplicationCache is requested for ApplicationCache.md#loadApplicationEntry[loadApplicationEntry]

                                                                                        | attachSparkUI | [[attachSparkUI]]

                                                                                        | detachSparkUI | [[detachSparkUI]] |===

                                                                                        [[implementations]] NOTE: HistoryServer.md[HistoryServer] is the one and only known implementation of <> in Apache Spark."},{"location":"history-server/ApplicationHistoryProvider/","title":"ApplicationHistoryProvider","text":"

                                                                                        ApplicationHistoryProvider is an abstraction of history providers.

                                                                                        "},{"location":"history-server/ApplicationHistoryProvider/#contract","title":"Contract","text":""},{"location":"history-server/ApplicationHistoryProvider/#getapplicationinfo","title":"getApplicationInfo
                                                                                        getApplicationInfo(\n  appId: String): Option[ApplicationInfo]\n

                                                                                        Used when...FIXME

                                                                                        ","text":""},{"location":"history-server/ApplicationHistoryProvider/#getappui","title":"getAppUI
                                                                                        getAppUI(\n  appId: String,\n  attemptId: Option[String]): Option[LoadedAppUI]\n

                                                                                        SparkUI for a given application (by appId)

                                                                                        Used when HistoryServer is requested for the UI of a Spark application

                                                                                        ","text":""},{"location":"history-server/ApplicationHistoryProvider/#getlisting","title":"getListing
                                                                                        getListing(): Iterator[ApplicationInfo]\n

                                                                                        Used when...FIXME

                                                                                        ","text":""},{"location":"history-server/ApplicationHistoryProvider/#onuidetached","title":"onUIDetached
                                                                                        onUIDetached(\n  appId: String,\n  attemptId: Option[String],\n  ui: SparkUI): Unit\n

                                                                                        Used when...FIXME

                                                                                        ","text":""},{"location":"history-server/ApplicationHistoryProvider/#writeeventlogs","title":"writeEventLogs
                                                                                        writeEventLogs(\n  appId: String,\n  attemptId: Option[String],\n  zipStream: ZipOutputStream): Unit\n

                                                                                        Writes events to a stream

                                                                                        Used when...FIXME

                                                                                        ","text":""},{"location":"history-server/ApplicationHistoryProvider/#implementations","title":"Implementations","text":"
                                                                                        • FsHistoryProvider
                                                                                        "},{"location":"history-server/EventLogFileWriter/","title":"EventLogFileWriter","text":"

                                                                                        EventLogFileWriter is...FIXME

                                                                                        "},{"location":"history-server/EventLoggingListener/","title":"EventLoggingListener","text":"

                                                                                        EventLoggingListener is a SparkListener that writes out JSON-encoded events of a Spark application with event logging enabled (based on spark.eventLog.enabled configuration property).

                                                                                        EventLoggingListener supports custom configuration properties.

                                                                                        EventLoggingListener writes out log files to a directory (based on spark.eventLog.dir configuration property).

                                                                                        "},{"location":"history-server/EventLoggingListener/#creating-instance","title":"Creating Instance","text":"

                                                                                        EventLoggingListener takes the following to be created:

                                                                                        • Application ID
                                                                                        • Application Attempt ID
                                                                                        • Log Directory
                                                                                        • SparkConf
                                                                                        • Hadoop Configuration

                                                                                          EventLoggingListener is created\u00a0when SparkContext is created (with spark.eventLog.enabled enabled).

                                                                                          "},{"location":"history-server/EventLoggingListener/#eventlogfilewriter","title":"EventLogFileWriter
                                                                                          logWriter: EventLogFileWriter\n

                                                                                          EventLoggingListener creates a EventLogFileWriter when created.

                                                                                          Note

                                                                                          All arguments to create an EventLoggingListener are passed to the EventLogFileWriter.

                                                                                          The EventLogFileWriter is started when EventLoggingListener is started.

                                                                                          The EventLogFileWriter is stopped when EventLoggingListener is stopped.

                                                                                          The EventLogFileWriter is requested to writeEvent when EventLoggingListener is requested to start and log an event.

                                                                                          ","text":""},{"location":"history-server/EventLoggingListener/#starting-eventlogginglistener","title":"Starting EventLoggingListener
                                                                                          start(): Unit\n

                                                                                          start requests the EventLogFileWriter to start and initEventLog.

                                                                                          ","text":""},{"location":"history-server/EventLoggingListener/#initeventlog","title":"initEventLog
                                                                                          initEventLog(): Unit\n

                                                                                          initEventLog...FIXME

                                                                                          ","text":""},{"location":"history-server/EventLoggingListener/#logging-event","title":"Logging Event
                                                                                          logEvent(\n  event: SparkListenerEvent,\n  flushLogger: Boolean = false): Unit\n

                                                                                          logEvent persists the given SparkListenerEvent in JSON format.

                                                                                          logEvent converts the event to JSON format and requests the EventLogFileWriter to write it out.

                                                                                          ","text":""},{"location":"history-server/EventLoggingListener/#stopping-eventlogginglistener","title":"Stopping EventLoggingListener
                                                                                          stop(): Unit\n

                                                                                          stop requests the EventLogFileWriter to stop.

                                                                                          stop is used when SparkContext is requested to stop.

                                                                                          ","text":""},{"location":"history-server/EventLoggingListener/#inprogress-file-extension","title":"inprogress File Extension

                                                                                          EventLoggingListener uses .inprogress file extension for in-flight event log files of active Spark applications.

                                                                                          ","text":""},{"location":"history-server/EventLoggingListener/#logging","title":"Logging

                                                                                          Enable ALL logging level for org.apache.spark.scheduler.EventLoggingListener logger to see what happens inside.

                                                                                          Add the following line to conf/log4j.properties:

                                                                                          log4j.logger.org.apache.spark.scheduler.EventLoggingListener=ALL\n

                                                                                          Refer to Logging.

                                                                                          ","text":""},{"location":"history-server/FsHistoryProvider/","title":"FsHistoryProvider","text":"

                                                                                          FsHistoryProvider is the default ApplicationHistoryProvider for Spark History Server.

                                                                                          "},{"location":"history-server/FsHistoryProvider/#creating-instance","title":"Creating Instance","text":"

                                                                                          FsHistoryProvider takes the following to be created:

                                                                                          • SparkConf
                                                                                          • Clock (default: SystemClock)

                                                                                            FsHistoryProvider is created\u00a0when HistoryServer standalone application is started (and no spark.history.provider configuration property was defined).

                                                                                            "},{"location":"history-server/FsHistoryProvider/#path-of-application-history-cache","title":"Path of Application History Cache
                                                                                            storePath: Option[File]\n

                                                                                            FsHistoryProvider uses spark.history.store.path configuration property for the directory to cache application history.

                                                                                            With storePath defined, FsHistoryProvider uses a LevelDB as the KVStore. Otherwise, a InMemoryStore.

                                                                                            With storePath defined, FsHistoryProvider uses a HistoryServerDiskManager as the disk manager.

                                                                                            ","text":""},{"location":"history-server/FsHistoryProvider/#disk-manager","title":"Disk Manager
                                                                                            diskManager: Option[HistoryServerDiskManager]\n

                                                                                            FsHistoryProvider creates a HistoryServerDiskManager when created (with storePath defined based on spark.history.store.path configuration property).

                                                                                            FsHistoryProvider uses the HistoryServerDiskManager for the following:

                                                                                            • startPolling
                                                                                            • getAppUI
                                                                                            • onUIDetached
                                                                                            • cleanAppData
                                                                                            ","text":""},{"location":"history-server/FsHistoryProvider/#sparkui-of-spark-application","title":"SparkUI of Spark Application
                                                                                            getAppUI(\n  appId: String,\n  attemptId: Option[String]): Option[LoadedAppUI]\n

                                                                                            getAppUI is part of the ApplicationHistoryProvider abstraction.

                                                                                            getAppUI...FIXME

                                                                                            ","text":""},{"location":"history-server/FsHistoryProvider/#onuidetached","title":"onUIDetached
                                                                                            onUIDetached(): Unit\n

                                                                                            onUIDetached is part of the ApplicationHistoryProvider abstraction.

                                                                                            onUIDetached...FIXME

                                                                                            ","text":""},{"location":"history-server/FsHistoryProvider/#loaddiskstore","title":"loadDiskStore
                                                                                            loadDiskStore(\n  dm: HistoryServerDiskManager,\n  appId: String,\n  attempt: AttemptInfoWrapper): KVStore\n

                                                                                            loadDiskStore...FIXME

                                                                                            loadDiskStore is used in getAppUI (with HistoryServerDiskManager available).

                                                                                            ","text":""},{"location":"history-server/FsHistoryProvider/#createinmemorystore","title":"createInMemoryStore
                                                                                            createInMemoryStore(\n  attempt: AttemptInfoWrapper): KVStore\n

                                                                                            createInMemoryStore...FIXME

                                                                                            createInMemoryStore is used in getAppUI.

                                                                                            ","text":""},{"location":"history-server/FsHistoryProvider/#rebuildappstore","title":"rebuildAppStore
                                                                                            rebuildAppStore(\n  store: KVStore,\n  reader: EventLogFileReader,\n  lastUpdated: Long): Unit\n

                                                                                            rebuildAppStore...FIXME

                                                                                            rebuildAppStore is used in loadDiskStore and createInMemoryStore.

                                                                                            ","text":""},{"location":"history-server/FsHistoryProvider/#cleanappdata","title":"cleanAppData
                                                                                            cleanAppData(\n  appId: String,\n  attemptId: Option[String],\n  logPath: String): Unit\n

                                                                                            cleanAppData...FIXME

                                                                                            cleanAppData is used in checkForLogs and deleteAttemptLogs.

                                                                                            ","text":""},{"location":"history-server/FsHistoryProvider/#polling-for-logs","title":"Polling for Logs
                                                                                            startPolling(): Unit\n

                                                                                            startPolling...FIXME

                                                                                            startPolling is used in initialize and startSafeModeCheckThread.

                                                                                            ","text":""},{"location":"history-server/FsHistoryProvider/#checking-available-event-logs","title":"Checking Available Event Logs
                                                                                            checkForLogs(): Unit\n

                                                                                            checkForLogs...FIXME

                                                                                            ","text":""},{"location":"history-server/FsHistoryProvider/#logging","title":"Logging

                                                                                            Enable ALL logging level for org.apache.spark.deploy.history.FsHistoryProvider logger to see what happens inside.

                                                                                            Add the following line to conf/log4j.properties:

                                                                                            log4j.logger.org.apache.spark.deploy.history.FsHistoryProvider=ALL\n

                                                                                            Refer to Logging.

                                                                                            ","text":""},{"location":"history-server/HistoryAppStatusStore/","title":"HistoryAppStatusStore","text":"

                                                                                            HistoryAppStatusStore is an AppStatusStore for SparkUIs in Spark History Server.

                                                                                            "},{"location":"history-server/HistoryAppStatusStore/#creating-instance","title":"Creating Instance","text":"

                                                                                            HistoryAppStatusStore takes the following to be created:

                                                                                            • SparkConf
                                                                                            • KVStore

                                                                                              HistoryAppStatusStore is created\u00a0when:

                                                                                              • FsHistoryProvider is requested for a SparkUI (of a Spark application)
                                                                                              "},{"location":"history-server/HistoryAppStatusStore/#executorlogurlhandler","title":"ExecutorLogUrlHandler
                                                                                              logUrlHandler: ExecutorLogUrlHandler\n

                                                                                              HistoryAppStatusStore creates an ExecutorLogUrlHandler (for the logUrlPattern) when created.

                                                                                              HistoryAppStatusStore uses it when requested to replaceLogUrls.

                                                                                              ","text":""},{"location":"history-server/HistoryAppStatusStore/#executorlist","title":"executorList
                                                                                              executorList(\n  exec: v1.ExecutorSummary,\n  urlPattern: String): v1.ExecutorSummary\n

                                                                                              executorList...FIXME

                                                                                              executorList\u00a0is part of the AppStatusStore abstraction.

                                                                                              ","text":""},{"location":"history-server/HistoryAppStatusStore/#executorsummary","title":"executorSummary
                                                                                              executorSummary(\n  executorId: String): v1.ExecutorSummary\n

                                                                                              executorSummary...FIXME

                                                                                              executorSummary\u00a0is part of the AppStatusStore abstraction.

                                                                                              ","text":""},{"location":"history-server/HistoryAppStatusStore/#replacelogurls","title":"replaceLogUrls
                                                                                              replaceLogUrls(\n  exec: v1.ExecutorSummary,\n  urlPattern: String): v1.ExecutorSummary\n

                                                                                              replaceLogUrls...FIXME

                                                                                              replaceLogUrls\u00a0is used when HistoryAppStatusStore is requested to executorList and executorSummary.

                                                                                              ","text":""},{"location":"history-server/HistoryServer/","title":"HistoryServer","text":"

                                                                                              HistoryServer is an extension of the web UI for reviewing event logs of running (active) and completed Spark applications with event log collection enabled (based on spark.eventLog.enabled configuration property).

                                                                                              "},{"location":"history-server/HistoryServer/#starting-historyserver-standalone-application","title":"Starting HistoryServer Standalone Application
                                                                                              main(\n  argStrings: Array[String]): Unit\n

                                                                                              main creates a HistoryServerArguments (with the given argStrings arguments).

                                                                                              main initializes security.

                                                                                              main creates an ApplicationHistoryProvider (based on spark.history.provider configuration property).

                                                                                              main creates a HistoryServer (with the ApplicationHistoryProvider and spark.history.ui.port configuration property) and requests it to bind.

                                                                                              main requests the ApplicationHistoryProvider to start.

                                                                                              main registers a shutdown hook that requests the HistoryServer to stop and sleeps...till the end of the world (giving the daemon thread a go).

                                                                                              ","text":""},{"location":"history-server/HistoryServer/#creating-instance","title":"Creating Instance

                                                                                              HistoryServer takes the following to be created:

                                                                                              • SparkConf
                                                                                              • ApplicationHistoryProvider
                                                                                              • SecurityManager
                                                                                              • Port number

                                                                                                When created, HistoryServer initializes itself.

                                                                                                HistoryServer is created\u00a0when HistoryServer standalone application is started.

                                                                                                ","text":""},{"location":"history-server/HistoryServer/#applicationcacheoperations","title":"ApplicationCacheOperations

                                                                                                HistoryServer is a ApplicationCacheOperations.

                                                                                                ","text":""},{"location":"history-server/HistoryServer/#uiroot","title":"UIRoot

                                                                                                HistoryServer is a UIRoot.

                                                                                                ","text":""},{"location":"history-server/HistoryServer/#initializing-historyserver","title":"Initializing HistoryServer
                                                                                                initialize(): Unit\n

                                                                                                initialize is part of the WebUI abstraction.

                                                                                                initialize...FIXME

                                                                                                ","text":""},{"location":"history-server/HistoryServer/#attaching-sparkui","title":"Attaching SparkUI
                                                                                                attachSparkUI(\n  appId: String,\n  attemptId: Option[String],\n  ui: SparkUI,\n  completed: Boolean): Unit\n

                                                                                                attachSparkUI is part of the ApplicationCacheOperations abstraction.

                                                                                                attachSparkUI...FIXME

                                                                                                ","text":""},{"location":"history-server/HistoryServer/#spark-ui","title":"Spark UI
                                                                                                getAppUI(\n  appId: String,\n  attemptId: Option[String]): Option[LoadedAppUI]\n

                                                                                                getAppUI is part of the ApplicationCacheOperations abstraction.

                                                                                                getAppUI requests the ApplicationHistoryProvider for the Spark UI of a Spark application (based on the appId and attemptId).

                                                                                                ","text":""},{"location":"history-server/HistoryServer/#logging","title":"Logging

                                                                                                Enable ALL logging level for org.apache.spark.deploy.history.HistoryServer logger to see what happens inside.

                                                                                                Add the following line to conf/log4j.properties:

                                                                                                log4j.logger.org.apache.spark.deploy.history.HistoryServer=ALL\n

                                                                                                Refer to Logging.

                                                                                                ","text":""},{"location":"history-server/HistoryServerArguments/","title":"HistoryServerArguments","text":"

                                                                                                == HistoryServerArguments

                                                                                                HistoryServerArguments is the command-line parser for the index.md[History Server].

                                                                                                When HistoryServerArguments is executed with a single command-line parameter it is assumed to be the event logs directory.

                                                                                                $ ./sbin/start-history-server.sh /tmp/spark-events\n

                                                                                                This is however deprecated since Spark 1.1.0 and you should see the following WARN message in the logs:

                                                                                                WARN HistoryServerArguments: Setting log directory through the command line is deprecated as of Spark 1.1.0. Please set this through spark.history.fs.logDirectory instead.\n

                                                                                                The same WARN message shows up for --dir and -d command-line options.

                                                                                                --properties-file [propertiesFile] command-line option specifies the file with the custom spark-properties.md[Spark properties].

                                                                                                NOTE: When not specified explicitly, History Server uses the default configuration file, i.e. spark-properties.md#spark-defaults-conf[spark-defaults.conf].

                                                                                                "},{"location":"history-server/HistoryServerArguments/#tip","title":"[TIP]","text":"

                                                                                                Enable WARN logging level for org.apache.spark.deploy.history.HistoryServerArguments logger to see what happens inside.

                                                                                                Add the following line to conf/log4j.properties:

                                                                                                log4j.logger.org.apache.spark.deploy.history.HistoryServerArguments=WARN\n
                                                                                                "},{"location":"history-server/HistoryServerArguments/#refer-to-spark-loggingmdlogging","title":"Refer to spark-logging.md[Logging].","text":""},{"location":"history-server/HistoryServerDiskManager/","title":"HistoryServerDiskManager","text":"

                                                                                                HistoryServerDiskManager is a disk manager for FsHistoryProvider.

                                                                                                "},{"location":"history-server/HistoryServerDiskManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                HistoryServerDiskManager takes the following to be created:

                                                                                                • SparkConf
                                                                                                • Path
                                                                                                • KVStore
                                                                                                • Clock

                                                                                                  HistoryServerDiskManager is created\u00a0when:

                                                                                                  • FsHistoryProvider is created (and initializes a diskManager)
                                                                                                  "},{"location":"history-server/HistoryServerDiskManager/#initializing","title":"Initializing
                                                                                                  initialize(): Unit\n

                                                                                                  initialize...FIXME

                                                                                                  initialize\u00a0is used when:

                                                                                                  • FsHistoryProvider is requested to startPolling
                                                                                                  ","text":""},{"location":"history-server/HistoryServerDiskManager/#releasing-application-store","title":"Releasing Application Store
                                                                                                  release(\n  appId: String,\n  attemptId: Option[String],\n  delete: Boolean = false): Unit\n

                                                                                                  release...FIXME

                                                                                                  release\u00a0is used when:

                                                                                                  • FsHistoryProvider is requested to onUIDetached, cleanAppData and loadDiskStore
                                                                                                  ","text":""},{"location":"history-server/JsonProtocol/","title":"JsonProtocol Utility","text":"

                                                                                                  JsonProtocol is an utility to convert SparkListenerEvents to and from JSON format.

                                                                                                  "},{"location":"history-server/JsonProtocol/#objectmapper","title":"ObjectMapper

                                                                                                  JsonProtocol uses an Jackson Databind ObjectMapper for performing conversions to and from JSON.

                                                                                                  ","text":""},{"location":"history-server/JsonProtocol/#converting-spark-event-to-json","title":"Converting Spark Event to JSON
                                                                                                  sparkEventToJson(\n  event: SparkListenerEvent): JValue\n

                                                                                                  sparkEventToJson converts the given SparkListenerEvent to JSON format.

                                                                                                  sparkEventToJson\u00a0is used when...FIXME

                                                                                                  ","text":""},{"location":"history-server/JsonProtocol/#converting-json-to-spark-event","title":"Converting JSON to Spark Event
                                                                                                  sparkEventFromJson(\n  json: JValue): SparkListenerEvent\n

                                                                                                  sparkEventFromJson converts a JSON-encoded event to a SparkListenerEvent.

                                                                                                  sparkEventFromJson\u00a0is used when...FIXME

                                                                                                  ","text":""},{"location":"history-server/ReplayListenerBus/","title":"ReplayListenerBus","text":"

                                                                                                  ReplayListenerBus is a SparkListenerBus that can replay JSON-encoded SparkListenerEvent events.

                                                                                                  ReplayListenerBus is used by FsHistoryProvider.

                                                                                                  "},{"location":"history-server/ReplayListenerBus/#replaying-json-encoded-sparklistenerevents","title":"Replaying JSON-encoded SparkListenerEvents
                                                                                                  replay(\n  logData: InputStream,\n  sourceName: String,\n  maybeTruncated: Boolean = false): Unit\n

                                                                                                  replay reads JSON-encoded SparkListener.md#SparkListenerEvent[SparkListenerEvent] events from logData (one event per line) and posts them to all registered SparkListenerInterfaces.

                                                                                                  replay uses spark-history-server:JsonProtocol.md#sparkEventFromJson[JsonProtocol to convert JSON-encoded events to SparkListenerEvent objects].

                                                                                                  NOTE: replay uses jackson from http://json4s.org/[json4s] library to parse the AST for JSON.

                                                                                                  When there is an exception parsing a JSON event, you may see the following WARN message in the logs (for the last line) or a JsonParseException.

                                                                                                  WARN Got JsonParseException from log file $sourceName at line [lineNumber], the file might not have finished writing cleanly.\n

                                                                                                  Any other non-IO exceptions end up with the following ERROR messages in the logs:

                                                                                                  ERROR Exception parsing Spark event log: [sourceName]\nERROR Malformed line #[lineNumber]: [currentLine]\n

                                                                                                  NOTE: The sourceName input argument is only used for messages.

                                                                                                  ","text":""},{"location":"history-server/SQLHistoryListener/","title":"SQLHistoryListener","text":"

                                                                                                  == SQLHistoryListener

                                                                                                  SQLHistoryListener is a custom spark-sql-SQLListener.md[SQLListener] for index.md[History Server]. It attaches spark-sql-webui.md#creating-instance[SQL tab] to History Server's web UI only when the first spark-sql-SQLListener.md#SparkListenerSQLExecutionStart[SparkListenerSQLExecutionStart] arrives and shuts <> off. It also handles <>.

                                                                                                  NOTE: Support for SQL UI in History Server was added in SPARK-11206 Support SQL UI on the history server.

                                                                                                  CAUTION: FIXME Add the link to the JIRA.

                                                                                                  === [[onOtherEvent]] onOtherEvent

                                                                                                  "},{"location":"history-server/SQLHistoryListener/#source-scala","title":"[source, scala]","text":""},{"location":"history-server/SQLHistoryListener/#onothereventevent-sparklistenerevent-unit","title":"onOtherEvent(event: SparkListenerEvent): Unit","text":"

                                                                                                  When SparkListenerSQLExecutionStart event comes, onOtherEvent attaches spark-sql-webui.md#creating-instance[SQL tab] to web UI and passes the call to the parent spark-sql-SQLListener.md[SQLListener].

                                                                                                  === [[onTaskEnd]] onTaskEnd

                                                                                                  CAUTION: FIXME

                                                                                                  === [[creating-instance]] Creating SQLHistoryListener Instance

                                                                                                  SQLHistoryListener is created using a (private[sql]) SQLHistoryListenerFactory class (which is SparkHistoryListenerFactory).

                                                                                                  The SQLHistoryListenerFactory class is registered when spark-webui-SparkUI.md#createHistoryUI[SparkUI creates a web UI for History Server] as a Java service in META-INF/services/org.apache.spark.scheduler.SparkHistoryListenerFactory:

                                                                                                  org.apache.spark.sql.execution.ui.SQLHistoryListenerFactory\n

                                                                                                  NOTE: Loading the service uses Java's https://docs.oracle.com/javase/8/docs/api/java/util/ServiceLoader.html#load-java.lang.Class-java.lang.ClassLoader-[ServiceLoader.load] method.

                                                                                                  === [[onExecutorMetricsUpdate]] onExecutorMetricsUpdate

                                                                                                  onExecutorMetricsUpdate does nothing.

                                                                                                  "},{"location":"history-server/configuration-properties/","title":"Configuration Properties","text":"

                                                                                                  The following contains the configuration properties of EventLoggingListener and HistoryServer.

                                                                                                  "},{"location":"history-server/configuration-properties/#sparkeventlog","title":"spark.eventLog","text":""},{"location":"history-server/configuration-properties/#bufferkb","title":"buffer.kb

                                                                                                  spark.eventLog.buffer.kb

                                                                                                  Size of the buffer to use when writing to output streams

                                                                                                  Default: 100k

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#compress","title":"compress

                                                                                                  spark.eventLog.compress

                                                                                                  Enables event compression (using a CompressionCodec)

                                                                                                  Default: false

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#compressioncodec","title":"compression.codec

                                                                                                  spark.eventLog.compression.codec

                                                                                                  The codec used to compress event log (with spark.eventLog.compress enabled). By default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also use fully qualified class names to specify the codec.

                                                                                                  Default: zstd

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#dir","title":"dir

                                                                                                  spark.eventLog.dir

                                                                                                  Directory where Spark events are logged to (e.g. hdfs://namenode:8021/directory)

                                                                                                  Default: /tmp/spark-events

                                                                                                  The directory must exist before SparkContext can be created

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#enabled","title":"enabled

                                                                                                  spark.eventLog.enabled

                                                                                                  Enables persisting Spark events

                                                                                                  Default: false

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#erasurecodingenabled","title":"erasureCoding.enabled

                                                                                                  spark.eventLog.erasureCoding.enabled

                                                                                                  Default: false

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#gcmetricsyounggenerationgarbagecollectors","title":"gcMetrics.youngGenerationGarbageCollectors

                                                                                                  spark.eventLog.gcMetrics.youngGenerationGarbageCollectors

                                                                                                  Names of supported young generation garbage collectors. A name usually is the output of GarbageCollectorMXBean.getName.

                                                                                                  Default: Copy, PS Scavenge, ParNew, G1 Young Generation (the built-in young generation garbage collectors)

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#gcmetricsoldgenerationgarbagecollectors","title":"gcMetrics.oldGenerationGarbageCollectors

                                                                                                  spark.eventLog.gcMetrics.oldGenerationGarbageCollectors

                                                                                                  Names of supported old generation garbage collectors. A name usually is the output of GarbageCollectorMXBean.getName.

                                                                                                  Default: MarkSweepCompact, PS MarkSweep, ConcurrentMarkSweep, G1 Old Generation (the built-in old generation garbage collectors)

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#logblockupdatesenabled","title":"logBlockUpdates.enabled

                                                                                                  spark.eventLog.logBlockUpdates.enabled

                                                                                                  Enables log RDD block updates using EventLoggingListener

                                                                                                  Default: false

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#logstageexecutormetrics","title":"logStageExecutorMetrics

                                                                                                  spark.eventLog.logStageExecutorMetrics

                                                                                                  Enables logging of per-stage peaks of executor metrics (for each executor) to the event log

                                                                                                  Default: false

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#longformenabled","title":"longForm.enabled

                                                                                                  spark.eventLog.longForm.enabled

                                                                                                  Default: false

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#overwrite","title":"overwrite

                                                                                                  spark.eventLog.overwrite

                                                                                                  Enables deleting (or at least overwriting) an existing .inprogress event log files

                                                                                                  Default: false

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#rollingenabled","title":"rolling.enabled

                                                                                                  spark.eventLog.rolling.enabled

                                                                                                  Enables rolling over event log files. When enabled, cuts down each event log file to spark.eventLog.rolling.maxFileSize

                                                                                                  Default: false

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#rollingmaxfilesize","title":"rolling.maxFileSize

                                                                                                  spark.eventLog.rolling.maxFileSize

                                                                                                  Max size of event log file to be rolled over (with spark.eventLog.rolling.enabled enabled)

                                                                                                  Default: 128m

                                                                                                  Must be at least 10 MiB

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#sparkhistory","title":"spark.history","text":""},{"location":"history-server/configuration-properties/#fslogdirectory","title":"fs.logDirectory

                                                                                                  spark.history.fs.logDirectory

                                                                                                  The directory for event log files. The directory has to exist before starting History Server.

                                                                                                  Default: file:/tmp/spark-events

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#kerberosenabled","title":"kerberos.enabled

                                                                                                  spark.history.kerberos.enabled

                                                                                                  Whether to enable (true) or disable (false) security when working with HDFS with security enabled (Kerberos).

                                                                                                  Default: false

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#kerberoskeytab","title":"kerberos.keytab

                                                                                                  spark.history.kerberos.keytab

                                                                                                  Keytab to use for login to Kerberos. Required when spark.history.kerberos.enabled is enabled.

                                                                                                  Default: (empty)

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#kerberosprincipal","title":"kerberos.principal

                                                                                                  spark.history.kerberos.principal

                                                                                                  Kerberos principal. Required when spark.history.kerberos.enabled is enabled.

                                                                                                  Default: (empty)

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#provider","title":"provider

                                                                                                  spark.history.provider

                                                                                                  Fully-qualified class name of an ApplicationHistoryProvider for HistoryServer.

                                                                                                  Default: org.apache.spark.deploy.history.FsHistoryProvider

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#retainedapplications","title":"retainedApplications

                                                                                                  spark.history.retainedApplications

                                                                                                  How many Spark applications HistoryServer should retain

                                                                                                  Default: 50

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#storepath","title":"store.path

                                                                                                  spark.history.store.path

                                                                                                  Local directory where to cache application history information (by )

                                                                                                  Default: (undefined) (i.e. all history information will be kept in memory)

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#uimaxapplications","title":"ui.maxApplications

                                                                                                  spark.history.ui.maxApplications

                                                                                                  How many Spark applications HistoryServer should show in the UI

                                                                                                  Default: (unbounded)

                                                                                                  ","text":""},{"location":"history-server/configuration-properties/#uiport","title":"ui.port

                                                                                                  spark.history.ui.port

                                                                                                  The port of History Server's web UI.

                                                                                                  Default: 18080

                                                                                                  ","text":""},{"location":"local/","title":"Spark local","text":"

                                                                                                  Spark local is one of the available runtime environments in Apache Spark. It is the only available runtime with no need for a proper cluster manager (and hence many call it a pseudo-cluster, however such concept do exist in Spark and is a bit different).

                                                                                                  Spark local is used for the following master URLs (as specified using <<../SparkConf.md#, SparkConf.setMaster>> method or <<../configuration-properties.md#spark.master, spark.master>> configuration property):

                                                                                                  • local (with exactly 1 CPU core)

                                                                                                  • local[n] (with exactly n CPU cores)

                                                                                                  • local[*] (with the total number of CPU cores that is the number of available CPU cores on the local machine)

                                                                                                  • local[n, m] (with exactly n CPU cores and m retries when a task fails)

                                                                                                  • local[*, m] (with the total number of CPU cores that is the number of available CPU cores on the local machine)

                                                                                                  Internally, Spark local uses <> as the <<../SchedulerBackend.md#, SchedulerBackend>> and executor:ExecutorBackend.md[].

                                                                                                  In this non-distributed multi-threaded runtime environment, Spark spawns all the main execution components - the spark-driver.md[driver] and an executor:Executor.md[] - in the same single JVM.

                                                                                                  The default parallelism is the number of threads as specified in the <>. This is the only mode where a driver is used for execution (as it acts both as the driver and the only executor).

                                                                                                  The local mode is very convenient for testing, debugging or demonstration purposes as it requires no earlier setup to launch Spark applications.

                                                                                                  This mode of operation is also called http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark[Spark in-process] or (less commonly) a local version of Spark.

                                                                                                  SparkContext.isLocal returns true when Spark runs in local mode.

                                                                                                  scala> sc.isLocal\nres0: Boolean = true\n

                                                                                                  Spark shell defaults to local mode with local[*] as the the master URL.

                                                                                                  scala> sc.master\nres0: String = local[*]\n

                                                                                                  Tasks are not re-executed on failure in local mode (unless <> is used).

                                                                                                  The scheduler:TaskScheduler.md[task scheduler] in local mode works with local/spark-LocalSchedulerBackend.md[LocalSchedulerBackend] task scheduler backend.

                                                                                                  "},{"location":"local/#master-url","title":"Master URL","text":"

                                                                                                  You can run Spark in local mode using local, local[n] or the most general local[*] for the master URL.

                                                                                                  The URL says how many threads can be used in total:

                                                                                                  • local uses 1 thread only.

                                                                                                  • local[n] uses n threads.

                                                                                                  • local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses https://docs.oracle.com/javase/8/docs/api/java/lang/Runtime.html#availableProcessors--[Runtime.getRuntime.availableProcessors()] to know the number).

                                                                                                  NOTE: What happens when there are less cores than n in local[n] master URL? \"Breaks\" scheduling as Spark assumes more CPU cores available to execute tasks.

                                                                                                  • [[local-with-retries]] local[N, maxFailures] (called local-with-retries) with N being * or the number of threads to use (as explained above) and maxFailures being the value of <<../configuration-properties.md#spark.task.maxFailures, spark.task.maxFailures>> configuration property.

                                                                                                  == [[task-submission]] Task Submission a.k.a. reviveOffers

                                                                                                  .TaskSchedulerImpl.submitTasks in local mode image::taskscheduler-submitTasks-local-mode.png[align=\"center\"]

                                                                                                  When ReviveOffers or StatusUpdate messages are received, local/spark-LocalEndpoint.md[LocalEndpoint] places an offer to TaskSchedulerImpl (using TaskSchedulerImpl.resourceOffers).

                                                                                                  If there is one or more tasks that match the offer, they are launched (using executor.launchTask method).

                                                                                                  The number of tasks to be launched is controlled by the number of threads as specified in <>. The executor uses threads to spawn the tasks."},{"location":"local/LauncherBackend/","title":"LauncherBackend","text":"

                                                                                                  == [[LauncherBackend]] LauncherBackend

                                                                                                  LauncherBackend is the <> of <> that can <>.

                                                                                                  [[contract]] .LauncherBackend Contract (Abstract Methods Only) [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                  | conf a| [[conf]]

                                                                                                  "},{"location":"local/LauncherBackend/#source-scala","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#conf-sparkconf","title":"conf: SparkConf","text":"

                                                                                                  SparkConf.md[]

                                                                                                  Used exclusively when LauncherBackend is requested to <> (to access configuration-properties.md#spark.launcher.port[spark.launcher.port] and configuration-properties.md#spark.launcher.secret[spark.launcher.secret] configuration properties)

                                                                                                  | onStopRequest a| [[onStopRequest]]

                                                                                                  "},{"location":"local/LauncherBackend/#source-scala_1","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#onstoprequest-unit","title":"onStopRequest(): Unit","text":"

                                                                                                  Handles stop requests (to stop the Spark application as gracefully as possible)

                                                                                                  Used exclusively when LauncherBackend is requested to <>

                                                                                                  |===

                                                                                                  [[creating-instance]] LauncherBackend takes no arguments to be created.

                                                                                                  NOTE: LauncherBackend is a Scala abstract class and cannot be <> directly. It is created indirectly for the <>.

                                                                                                  [[internal-registries]] .LauncherBackend's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                  | _isConnected a| [[_isConnected]][[isConnected]] Flag that says whether...FIXME (true) or not (false)

                                                                                                  Default: false

                                                                                                  Used when...FIXME

                                                                                                  | clientThread a| [[clientThread]] Java's https://docs.oracle.com/javase/8/docs/api/java/lang/Thread.html[java.lang.Thread]

                                                                                                  Used when...FIXME

                                                                                                  | connection a| [[connection]] BackendConnection

                                                                                                  Used when...FIXME

                                                                                                  | lastState a| [[lastState]] SparkAppHandle.State

                                                                                                  Used when...FIXME

                                                                                                  |===

                                                                                                  [[implementations]] LauncherBackend is <> (as an anonymous class) for the following:

                                                                                                  • Spark on YARN's <>

                                                                                                  • Spark local's <>

                                                                                                  • Spark on Mesos' <>

                                                                                                  • Spark Standalone's <>

                                                                                                    === [[close]] Closing -- close Method

                                                                                                    "},{"location":"local/LauncherBackend/#source-scala_2","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#close-unit","title":"close(): Unit","text":"

                                                                                                    close...FIXME

                                                                                                    NOTE: close is used when...FIXME

                                                                                                    === [[connect]] Connecting -- connect Method

                                                                                                    "},{"location":"local/LauncherBackend/#source-scala_3","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#connect-unit","title":"connect(): Unit","text":"

                                                                                                    connect...FIXME

                                                                                                    "},{"location":"local/LauncherBackend/#note","title":"[NOTE]","text":"

                                                                                                    connect is used when:

                                                                                                    • Spark Standalone's StandaloneSchedulerBackend is requested to <> (in client deploy mode)

                                                                                                    • Spark local's LocalSchedulerBackend is <>

                                                                                                    • Spark on Mesos' MesosCoarseGrainedSchedulerBackend is requested to <> (in client deploy mode)"},{"location":"local/LauncherBackend/#spark-on-yarns-client-is-requested-to","title":"* Spark on YARN's Client is requested to <>

                                                                                                      === [[fireStopRequest]] fireStopRequest Internal Method

                                                                                                      ","text":""},{"location":"local/LauncherBackend/#source-scala_4","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#firestoprequest-unit","title":"fireStopRequest(): Unit","text":"

                                                                                                      fireStopRequest...FIXME

                                                                                                      NOTE: fireStopRequest is used exclusively when BackendConnection is requested to handle a Stop message.

                                                                                                      === [[onDisconnected]] Handling Disconnects From Scheduling Backend -- onDisconnected Method

                                                                                                      "},{"location":"local/LauncherBackend/#source-scala_5","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#ondisconnected-unit","title":"onDisconnected(): Unit","text":"

                                                                                                      onDisconnected does nothing by default and is expected to be overriden by <>.

                                                                                                      NOTE: onDisconnected is used when...FIXME

                                                                                                      === [[setAppId]] setAppId Method

                                                                                                      "},{"location":"local/LauncherBackend/#source-scala_6","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#setappidappid-string-unit","title":"setAppId(appId: String): Unit","text":"

                                                                                                      setAppId...FIXME

                                                                                                      NOTE: setAppId is used when...FIXME

                                                                                                      === [[setState]] setState Method

                                                                                                      "},{"location":"local/LauncherBackend/#source-scala_7","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#setstatestate-sparkapphandlestate-unit","title":"setState(state: SparkAppHandle.State): Unit","text":"

                                                                                                      setState...FIXME

                                                                                                      NOTE: setState is used when...FIXME

                                                                                                      "},{"location":"local/LocalEndpoint/","title":"LocalEndpoint","text":"

                                                                                                      LocalEndpoint is the ThreadSafeRpcEndpoint for LocalSchedulerBackend and is registered under the LocalSchedulerBackendEndpoint name.

                                                                                                      "},{"location":"local/LocalEndpoint/#review-me","title":"Review Me","text":"

                                                                                                      LocalEndpoint is <> exclusively when LocalSchedulerBackend is requested to <>.

                                                                                                      Put simply, LocalEndpoint is the communication channel between <> and <>. LocalEndpoint is a (thread-safe) rpc:RpcEndpoint.md[RpcEndpoint] that hosts an <> (with driver ID and localhost hostname) for Spark local mode.

                                                                                                      [[messages]] .LocalEndpoint's RPC Messages [cols=\"1,3\",options=\"header\",width=\"100%\"] |=== | Message | Description

                                                                                                      | <> | Requests the <> to executor:Executor.md#killTask[kill a given task]

                                                                                                      | <> | Calls <> <>

                                                                                                      | <> | Requests the <> to executor:Executor.md#stop[stop]

                                                                                                      |===

                                                                                                      When a LocalEndpoint starts up (as part of Spark local's initialization) it prints out the following INFO messages to the logs:

                                                                                                      INFO Executor: Starting executor ID driver on host localhost\nINFO Executor: Using REPL class URI: http://192.168.1.4:56131\n

                                                                                                      [[executor]] LocalEndpoint creates a single executor:Executor.md[] with the following properties:

                                                                                                      • [[localExecutorId]] driver ID for the executor:Executor.md#executorId[executor ID]

                                                                                                      • [[localExecutorHostname]] localhost for the executor:Executor.md#executorHostname[hostname]

                                                                                                      • <> for the executor:Executor.md#userClassPath[user-defined CLASSPATH]

                                                                                                      • executor:Executor.md#isLocal[isLocal] flag enabled

                                                                                                      • The <> is then used when LocalEndpoint is requested to handle <> and <> RPC messages, and <>.

                                                                                                        [[internal-registries]] .LocalEndpoint's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                        | freeCores a| [[freeCores]] The number of CPU cores that are free to use (to schedule tasks)

                                                                                                        Default: Initial <> (aka totalCores)

                                                                                                        Increments when LocalEndpoint is requested to handle <> RPC message with a finished state

                                                                                                        Decrements when LocalEndpoint is requested to <> and there were tasks to execute

                                                                                                        NOTE: A single task to execute costs scheduler:TaskSchedulerImpl.md#CPUS_PER_TASK[spark.task.cpus] configuration (default: 1).

                                                                                                        Used when LocalEndpoint is requested to <>

                                                                                                        |===

                                                                                                        [[logging]] [TIP] ==== Enable INFO logging level for org.apache.spark.scheduler.local.LocalEndpoint logger to see what happens inside.

                                                                                                        Add the following line to conf/log4j.properties:

                                                                                                        log4j.logger.org.apache.spark.scheduler.local.LocalEndpoint=INFO\n
                                                                                                        "},{"location":"local/LocalEndpoint/#refer-to-spark-loggingmd-logging","title":"Refer to <<../spark-logging.md#, Logging>>.","text":"

                                                                                                        === [[creating-instance]] Creating LocalEndpoint Instance

                                                                                                        LocalEndpoint takes the following to be created:

                                                                                                        • [[rpcEnv]] <<../index.md#, RpcEnv>>
                                                                                                        • [[userClassPath]] User-defined class path (Seq[URL]) that is the <> configuration property and is used exclusively to create the <>
                                                                                                        • [[scheduler]] scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl]
                                                                                                        • [[executorBackend]] <>
                                                                                                        • [[totalCores]] Number of CPU cores (aka totalCores)
                                                                                                        • LocalEndpoint initializes the <>.

                                                                                                          === [[receive]] Processing Receive-Only RPC Messages -- receive Method

                                                                                                          "},{"location":"local/LocalEndpoint/#source-scala","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#receive-partialfunctionany-unit","title":"receive: PartialFunction[Any, Unit]","text":"

                                                                                                          NOTE: receive is part of the rpc:RpcEndpoint.md#receive[RpcEndpoint] abstraction.

                                                                                                          receive handles (processes) <>, <>, and <> RPC messages.

                                                                                                          ==== [[ReviveOffers]] ReviveOffers RPC Message

                                                                                                          "},{"location":"local/LocalEndpoint/#source-scala_1","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#reviveoffers","title":"ReviveOffers()","text":"

                                                                                                          When <>, LocalEndpoint <>.

                                                                                                          NOTE: ReviveOffers RPC message is sent out exclusively when LocalSchedulerBackend is requested to <>.

                                                                                                          ==== [[StatusUpdate]] StatusUpdate RPC Message

                                                                                                          "},{"location":"local/LocalEndpoint/#source-scala_2","title":"[source, scala]","text":"

                                                                                                          StatusUpdate( taskId: Long, state: TaskState, serializedData: ByteBuffer)

                                                                                                          When <>, LocalEndpoint requests the <> to scheduler:TaskSchedulerImpl.md#statusUpdate[handle a task status update] (given the taskId, the task state and the data).

                                                                                                          If the given scheduler:Task.md#TaskState[TaskState] is a finished state (one of FINISHED, FAILED, KILLED, LOST states), LocalEndpoint adds scheduler:TaskSchedulerImpl.md#CPUS_PER_TASK[spark.task.cpus] configuration (default: 1) to the <> registry followed by <>.

                                                                                                          NOTE: StatusUpdate RPC message is sent out exclusively when LocalSchedulerBackend is requested to <>.

                                                                                                          ==== [[KillTask]] KillTask RPC Message

                                                                                                          "},{"location":"local/LocalEndpoint/#source-scala_3","title":"[source, scala]","text":"

                                                                                                          KillTask( taskId: Long, interruptThread: Boolean, reason: String)

                                                                                                          When <>, LocalEndpoint requests the single <> to executor:Executor.md#killTask[kill a task] (given the taskId, the interruptThread flag and the reason).

                                                                                                          NOTE: KillTask RPC message is sent out exclusively when LocalSchedulerBackend is requested to <>.

                                                                                                          === [[reviveOffers]] Reviving Offers -- reviveOffers Method

                                                                                                          "},{"location":"local/LocalEndpoint/#source-scala_4","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#reviveoffers-unit","title":"reviveOffers(): Unit","text":"

                                                                                                          reviveOffers...FIXME

                                                                                                          NOTE: reviveOffers is used when LocalEndpoint is requested to <> (namely <> and <>).

                                                                                                          === [[receiveAndReply]] Processing Receive-Reply RPC Messages -- receiveAndReply Method

                                                                                                          "},{"location":"local/LocalEndpoint/#source-scala_5","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#receiveandreplycontext-rpccallcontext-partialfunctionany-unit","title":"receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit]","text":"

                                                                                                          NOTE: receiveAndReply is part of the rpc:RpcEndpoint.md#receiveAndReply[RpcEndpoint] abstraction.

                                                                                                          receiveAndReply handles (processes) <> RPC message exclusively.

                                                                                                          ==== [[StopExecutor]] StopExecutor RPC Message

                                                                                                          "},{"location":"local/LocalEndpoint/#source-scala_6","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#stopexecutor","title":"StopExecutor()","text":"

                                                                                                          When <>, LocalEndpoint requests the single <> to executor:Executor.md#stop[stop] and requests the given RpcCallContext to reply with true (as the response).

                                                                                                          NOTE: StopExecutor RPC message is sent out exclusively when LocalSchedulerBackend is requested to <>."},{"location":"local/LocalSchedulerBackend/","title":"LocalSchedulerBackend","text":"

                                                                                                          LocalSchedulerBackend is a SchedulerBackend and an ExecutorBackend for Spark local deployment.

                                                                                                          Master URL Total CPU Cores local 1 local[n] n local[*] The number of available CPU cores on the local machine local[n, m] n CPU cores and m task retries local[*, m] The number of available CPU cores on the local machine and m task retries

                                                                                                          "},{"location":"local/LocalSchedulerBackend/#creating-instance","title":"Creating Instance","text":"

                                                                                                          LocalSchedulerBackend takes the following to be created:

                                                                                                          • SparkConf
                                                                                                          • TaskSchedulerImpl
                                                                                                          • Total number of CPU cores

                                                                                                            LocalSchedulerBackend is created when:

                                                                                                            • SparkContext is requested to create a Spark Scheduler (for local master URL)
                                                                                                            • KubernetesClusterManager (Spark on Kubernetes) is requested for a SchedulerBackend
                                                                                                            "},{"location":"local/LocalSchedulerBackend/#maxNumConcurrentTasks","title":"Maximum Number of Concurrent Tasks","text":"SchedulerBackend
                                                                                                            maxNumConcurrentTasks(\n  rp: ResourceProfile): Int\n

                                                                                                            maxNumConcurrentTasks is part of the SchedulerBackend abstraction.

                                                                                                            maxNumConcurrentTasks calculates the number of CPU cores per task for the given ResourceProfile (and this SparkConf).

                                                                                                            In the end, maxNumConcurrentTasks is the total CPU cores available divided by the number of CPU cores per task.

                                                                                                            "},{"location":"local/LocalSchedulerBackend/#logging","title":"Logging","text":"

                                                                                                            Enable ALL logging level for org.apache.spark.scheduler.local.LocalSchedulerBackend logger to see what happens inside.

                                                                                                            Add the following line to conf/log4j2.properties:

                                                                                                            logger.LocalSchedulerBackend.name = org.apache.spark.scheduler.local.LocalSchedulerBackend\nlogger.LocalSchedulerBackend.level = all\n

                                                                                                            Refer to Logging.

                                                                                                            "},{"location":"memory/","title":"Memory System","text":"

                                                                                                            Memory System is a core component of Apache Spark that is based on UnifiedMemoryManager.

                                                                                                            "},{"location":"memory/#resources","title":"Resources","text":"
                                                                                                            • SPARK-10000: Consolidate storage and execution memory management
                                                                                                            "},{"location":"memory/#videos","title":"Videos","text":"
                                                                                                            • Deep Dive: Apache Spark Memory Management
                                                                                                            • Deep Dive into Project Tungsten
                                                                                                            • Spark Performance: What's Next
                                                                                                            "},{"location":"memory/ExecutionMemoryPool/","title":"ExecutionMemoryPool","text":"

                                                                                                            ExecutionMemoryPool is a MemoryPool.

                                                                                                            "},{"location":"memory/ExecutionMemoryPool/#creating-instance","title":"Creating Instance","text":"

                                                                                                            ExecutionMemoryPool takes the following to be created:

                                                                                                            • Lock Object
                                                                                                            • MemoryMode (ON_HEAP or OFF_HEAP)

                                                                                                              ExecutionMemoryPool is created\u00a0when:

                                                                                                              • MemoryManager is created (and initializes on-heap and off-heap execution memory pools)
                                                                                                              "},{"location":"memory/ExecutionMemoryPool/#acquiring-memory","title":"Acquiring Memory
                                                                                                              acquireMemory(\n  numBytes: Long,\n  taskAttemptId: Long,\n  maybeGrowPool: Long => Unit = (additionalSpaceNeeded: Long) => (),\n  computeMaxPoolSize: () => Long = () => poolSize): Long\n

                                                                                                              acquireMemory...FIXME

                                                                                                              acquireMemory\u00a0is used when:

                                                                                                              • UnifiedMemoryManager is requested to acquire execution memory
                                                                                                              ","text":""},{"location":"memory/MemoryAllocator/","title":"MemoryAllocator","text":"

                                                                                                              MemoryAllocator is an abstraction of memory allocators that TaskMemoryManager uses to allocate and release memory.

                                                                                                              MemoryAllocator creates the available MemoryAllocators to be available under the names HEAP and UNSAFE.

                                                                                                              A MemoryAllocator to use is selected when MemoryManager is created (based on MemoryMode).

                                                                                                              "},{"location":"memory/MemoryAllocator/#contract","title":"Contract","text":""},{"location":"memory/MemoryAllocator/#allocating-contiguous-block-of-memory","title":"Allocating Contiguous Block of Memory
                                                                                                              MemoryBlock allocate(\n  long size)\n

                                                                                                              Used when:

                                                                                                              • TaskMemoryManager is requested to allocate a memory page
                                                                                                              ","text":""},{"location":"memory/MemoryAllocator/#releasing-memory","title":"Releasing Memory
                                                                                                              void free(\n  MemoryBlock memory)\n

                                                                                                              Used when:

                                                                                                              • TaskMemoryManager is requested to release a memory page and clean up all the allocated memory
                                                                                                              ","text":""},{"location":"memory/MemoryAllocator/#implementations","title":"Implementations","text":"
                                                                                                              • HeapMemoryAllocator
                                                                                                              • UnsafeMemoryAllocator"},{"location":"memory/MemoryConsumer/","title":"MemoryConsumer","text":"

                                                                                                                MemoryConsumer is an abstraction of memory consumers (of TaskMemoryManager) that support spilling.

                                                                                                                MemoryConsumers correspond to individual operators and data structures within a task. TaskMemoryManager receives memory allocation requests from MemoryConsumers and issues callbacks to consumers in order to trigger spilling when running low on memory.

                                                                                                                A MemoryConsumer basically tracks how much memory is allocated.

                                                                                                                "},{"location":"memory/MemoryConsumer/#contract","title":"Contract","text":""},{"location":"memory/MemoryConsumer/#spilling","title":"Spilling
                                                                                                                void spill() // (1)\nlong spill(\n  long size,\n  MemoryConsumer trigger)\n
                                                                                                                1. Uses MAX_VALUE for the size and this MemoryConsumer

                                                                                                                Used when:

                                                                                                                • TaskMemoryManager is requested to acquire execution memory (and trySpillAndAcquire)
                                                                                                                • ShuffleExternalSorter is requested to growPointerArrayIfNecessary, insertRecord
                                                                                                                • UnsafeExternalSorter is requested to createWithExistingInMemorySorter, growPointerArrayIfNecessary, insertRecord, merge
                                                                                                                ","text":""},{"location":"memory/MemoryConsumer/#implementations","title":"Implementations","text":"
                                                                                                                • BytesToBytesMap
                                                                                                                • ShuffleExternalSorter
                                                                                                                • Spillable
                                                                                                                • UnsafeExternalSorter
                                                                                                                • a few others
                                                                                                                "},{"location":"memory/MemoryConsumer/#creating-instance","title":"Creating Instance","text":"

                                                                                                                MemoryConsumer takes the following to be created:

                                                                                                                • TaskMemoryManager
                                                                                                                • Page Size
                                                                                                                • MemoryMode (ON_HEAP or OFF_HEAP) Abstract Class

                                                                                                                  MemoryConsumer\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete MemoryConsumers.

                                                                                                                  "},{"location":"memory/MemoryManager/","title":"MemoryManager","text":"

                                                                                                                  MemoryManager is an abstraction of memory managers that can share available memory between tasks (TaskMemoryManager) and storage (BlockManager).

                                                                                                                  MemoryManager splits assigned memory into two regions:

                                                                                                                  • Execution Memory for shuffles, joins, sorts and aggregations

                                                                                                                  • Storage Memory for caching and propagating internal data across Spark nodes (in on- and off-heap modes)

                                                                                                                  MemoryManager is used to create BlockManager (and MemoryStore) and TaskMemoryManager.

                                                                                                                  "},{"location":"memory/MemoryManager/#contract","title":"Contract","text":""},{"location":"memory/MemoryManager/#acquiring-execution-memory-for-task","title":"Acquiring Execution Memory for Task
                                                                                                                  acquireExecutionMemory(\n  numBytes: Long,\n  taskAttemptId: Long,\n  memoryMode: MemoryMode): Long\n

                                                                                                                  Used when:

                                                                                                                  • TaskMemoryManager is requested to acquire execution memory
                                                                                                                  ","text":""},{"location":"memory/MemoryManager/#acquiring-storage-memory-for-block","title":"Acquiring Storage Memory for Block
                                                                                                                  acquireStorageMemory(\n  blockId: BlockId,\n  numBytes: Long,\n  memoryMode: MemoryMode): Boolean\n

                                                                                                                  Used when:

                                                                                                                  • MemoryStore is requested for the putBytes and putIterator
                                                                                                                  ","text":""},{"location":"memory/MemoryManager/#acquiring-unroll-memory-for-block","title":"Acquiring Unroll Memory for Block
                                                                                                                  acquireUnrollMemory(\n  blockId: BlockId,\n  numBytes: Long,\n  memoryMode: MemoryMode): Boolean\n

                                                                                                                  Used when:

                                                                                                                  • MemoryStore is requested for the reserveUnrollMemoryForThisTask
                                                                                                                  ","text":""},{"location":"memory/MemoryManager/#total-available-off-heap-storage-memory","title":"Total Available Off-Heap Storage Memory
                                                                                                                  maxOffHeapStorageMemory: Long\n

                                                                                                                  May vary over time

                                                                                                                  Used when:

                                                                                                                  • BlockManager is created
                                                                                                                  • MemoryStore is requested for the maxMemory
                                                                                                                  ","text":""},{"location":"memory/MemoryManager/#total-available-on-heap-storage-memory","title":"Total Available On-Heap Storage Memory
                                                                                                                  maxOnHeapStorageMemory: Long\n

                                                                                                                  May vary over time

                                                                                                                  Used when:

                                                                                                                  • BlockManager is created
                                                                                                                  • MemoryStore is requested for the maxMemory
                                                                                                                  ","text":""},{"location":"memory/MemoryManager/#implementations","title":"Implementations","text":"
                                                                                                                  • UnifiedMemoryManager
                                                                                                                  "},{"location":"memory/MemoryManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                  MemoryManager takes the following to be created:

                                                                                                                  • SparkConf
                                                                                                                  • Number of CPU Cores
                                                                                                                  • Size of the On-Heap Storage Memory
                                                                                                                  • Size of the On-Heap Execution Memory Abstract Class

                                                                                                                    MemoryManager\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete MemoryManagers.

                                                                                                                    "},{"location":"memory/MemoryManager/#SparkEnv","title":"Accessing MemoryManager","text":"

                                                                                                                    MemoryManager is available as SparkEnv.memoryManager on the driver and executors.

                                                                                                                    import org.apache.spark.SparkEnv\nval mm = SparkEnv.get.memoryManager\n
                                                                                                                    // MemoryManager is private[spark]\n// the following won't work unless within org.apache.spark package\n// import org.apache.spark.memory.MemoryManager\n// assert(mm.isInstanceOf[MemoryManager])\n\n// we have to revert to string comparision \ud83d\ude14\nassert(\"UnifiedMemoryManager\".equals(mm.getClass.getSimpleName))\n
                                                                                                                    "},{"location":"memory/MemoryManager/#associating-memorystore-with-storage-memory-pools","title":"Associating MemoryStore with Storage Memory Pools
                                                                                                                    setMemoryStore(\n  store: MemoryStore): Unit\n

                                                                                                                    setMemoryStore requests the on-heap and off-heap storage memory pools to use the given MemoryStore.

                                                                                                                    setMemoryStore\u00a0is used when:

                                                                                                                    • BlockManager is created
                                                                                                                    ","text":""},{"location":"memory/MemoryManager/#execution-memory-pools","title":"Execution Memory Pools","text":""},{"location":"memory/MemoryManager/#on-heap","title":"On-Heap
                                                                                                                    onHeapExecutionMemoryPool: ExecutionMemoryPool\n

                                                                                                                    MemoryManager creates an ExecutionMemoryPool for ON_HEAP memory mode when created and immediately requests it to incrementPoolSize to onHeapExecutionMemory.

                                                                                                                    ","text":""},{"location":"memory/MemoryManager/#off-heap","title":"Off-Heap
                                                                                                                    offHeapExecutionMemoryPool: ExecutionMemoryPool\n

                                                                                                                    MemoryManager creates an ExecutionMemoryPool for OFF_HEAP memory mode when created and immediately requests it to incrementPoolSize to...FIXME

                                                                                                                    ","text":""},{"location":"memory/MemoryManager/#storage-memory-pools","title":"Storage Memory Pools","text":""},{"location":"memory/MemoryManager/#on-heap_1","title":"On-Heap
                                                                                                                    onHeapStorageMemoryPool: StorageMemoryPool\n

                                                                                                                    MemoryManager creates a StorageMemoryPool for ON_HEAP memory mode when created and immediately requests it to incrementPoolSize to onHeapExecutionMemory.

                                                                                                                    onHeapStorageMemoryPool is requested to setMemoryStore when MemoryManager is requested to setMemoryStore.

                                                                                                                    onHeapStorageMemoryPool is requested to release memory when MemoryManager is requested to release on-heap storage memory.

                                                                                                                    onHeapStorageMemoryPool is requested to release all memory when MemoryManager is requested to release all storage memory.

                                                                                                                    onHeapStorageMemoryPool is used when:

                                                                                                                    • MemoryManager is requested for the storageMemoryUsed and onHeapStorageMemoryUsed
                                                                                                                    • UnifiedMemoryManager is requested to acquire on-heap execution and storage memory
                                                                                                                    ","text":""},{"location":"memory/MemoryManager/#off-heap_1","title":"Off-Heap
                                                                                                                    offHeapStorageMemoryPool: StorageMemoryPool\n

                                                                                                                    MemoryManager creates a StorageMemoryPool for OFF_HEAP memory mode when created and immediately requested it to incrementPoolSize to offHeapStorageMemory.

                                                                                                                    MemoryManager requests the MemoryPools to use a given MemoryStore when requested to setMemoryStore.

                                                                                                                    MemoryManager requests the MemoryPools to release memory when requested to releaseStorageMemory.

                                                                                                                    MemoryManager requests the MemoryPools to release all memory when requested to release all storage memory.

                                                                                                                    MemoryManager requests the MemoryPools for the memoryUsed when requested for storageMemoryUsed.

                                                                                                                    offHeapStorageMemoryPool is used when:

                                                                                                                    • MemoryManager is requested for the offHeapStorageMemoryUsed
                                                                                                                    • UnifiedMemoryManager is requested to acquire off-heap execution and storage memory
                                                                                                                    ","text":""},{"location":"memory/MemoryManager/#total-storage-memory-used","title":"Total Storage Memory Used
                                                                                                                    storageMemoryUsed: Long\n

                                                                                                                    storageMemoryUsed is the sum of the memory used of the on-heap and off-heap storage memory pools.

                                                                                                                    storageMemoryUsed\u00a0is used when:

                                                                                                                    • TaskMemoryManager is requested to showMemoryUsage
                                                                                                                    • MemoryStore is requested to memoryUsed
                                                                                                                    ","text":""},{"location":"memory/MemoryManager/#memorymode","title":"MemoryMode
                                                                                                                    tungstenMemoryMode: MemoryMode\n

                                                                                                                    tungstenMemoryMode tracks whether Tungsten memory will be allocated on the JVM heap or off-heap (using sun.misc.Unsafe).

                                                                                                                    final val

                                                                                                                    tungstenMemoryMode is a final value so initialized once when MemoryManager is created.

                                                                                                                    tungstenMemoryMode is OFF_HEAP when the following are all met:

                                                                                                                    • spark.memory.offHeap.enabled configuration property is enabled

                                                                                                                    • spark.memory.offHeap.size configuration property is greater than 0

                                                                                                                    • JVM supports unaligned memory access (aka unaligned Unsafe, i.e. sun.misc.Unsafe package is available and the underlying system has unaligned-access capability)

                                                                                                                    Otherwise, tungstenMemoryMode is ON_HEAP.

                                                                                                                    Note

                                                                                                                    Given that spark.memory.offHeap.enabled configuration property is turned off by default and spark.memory.offHeap.size configuration property is 0 by default, Apache Spark seems to encourage using Tungsten memory allocated on the JVM heap (ON_HEAP).

                                                                                                                    tungstenMemoryMode is used when:

                                                                                                                    • MemoryManager is created (and initializes the pageSizeBytes and tungstenMemoryAllocator internal properties)
                                                                                                                    • TaskMemoryManager is created
                                                                                                                    ","text":""},{"location":"memory/MemoryManager/#memoryallocator","title":"MemoryAllocator
                                                                                                                    tungstenMemoryAllocator: MemoryAllocator\n

                                                                                                                    MemoryManager selects the MemoryAllocator to use based on the MemoryMode.

                                                                                                                    final val

                                                                                                                    tungstenMemoryAllocator is a final value so initialized once when MemoryManager is created.

                                                                                                                    MemoryMode MemoryAllocator ON_HEAP HeapMemoryAllocator OFF_HEAP UnsafeMemoryAllocator

                                                                                                                    tungstenMemoryAllocator is used when:

                                                                                                                    • TaskMemoryManager is requested to allocate a memory page, release a memory page and clean up all the allocated memory
                                                                                                                    ","text":""},{"location":"memory/MemoryManager/#pageSizeBytes","title":"Page Size

                                                                                                                    pageSizeBytes is either spark.buffer.pageSize, if defined, or the default page size.

                                                                                                                    pageSizeBytes is used when:

                                                                                                                    • TaskMemoryManager is requested for the page size
                                                                                                                    ","text":""},{"location":"memory/MemoryManager/#defaultPageSizeBytes","title":"Default Page Size
                                                                                                                    defaultPageSizeBytes: Long\n
                                                                                                                    Lazy Value

                                                                                                                    defaultPageSizeBytes is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

                                                                                                                    Learn more in the Scala Language Specification.

                                                                                                                    ","text":""},{"location":"memory/MemoryPool/","title":"MemoryPool","text":"

                                                                                                                    MemoryPool is an abstraction of memory pools.

                                                                                                                    "},{"location":"memory/MemoryPool/#contract","title":"Contract","text":""},{"location":"memory/MemoryPool/#size-of-memory-used","title":"Size of Memory Used
                                                                                                                    memoryUsed: Long\n

                                                                                                                    Used when:

                                                                                                                    • MemoryPool is requested for the amount of free memory and decrementPoolSize
                                                                                                                    ","text":""},{"location":"memory/MemoryPool/#implementations","title":"Implementations","text":"
                                                                                                                    • ExecutionMemoryPool
                                                                                                                    • StorageMemoryPool
                                                                                                                    "},{"location":"memory/MemoryPool/#creating-instance","title":"Creating Instance","text":"

                                                                                                                    MemoryPool takes the following to be created:

                                                                                                                    • Lock Object Abstract Class

                                                                                                                      MemoryPool\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete MemoryPools.

                                                                                                                      "},{"location":"memory/MemoryPool/#free-memory","title":"Free Memory
                                                                                                                      memoryFree\n

                                                                                                                      memoryFree...FIXME

                                                                                                                      memoryFree\u00a0is used when:

                                                                                                                      • ExecutionMemoryPool is requested to acquireMemory
                                                                                                                      • StorageMemoryPool is requested to acquireMemory and freeSpaceToShrinkPool
                                                                                                                      • UnifiedMemoryManager is requested to acquire execution and storage memory
                                                                                                                      ","text":""},{"location":"memory/MemoryPool/#decrementpoolsize","title":"decrementPoolSize
                                                                                                                      decrementPoolSize(\n  delta: Long): Unit\n

                                                                                                                      decrementPoolSize...FIXME

                                                                                                                      decrementPoolSize\u00a0is used when:

                                                                                                                      • UnifiedMemoryManager is requested to acquireExecutionMemory and acquireStorageMemory
                                                                                                                      ","text":""},{"location":"memory/StorageMemoryPool/","title":"StorageMemoryPool","text":"

                                                                                                                      StorageMemoryPool is a MemoryPool.

                                                                                                                      "},{"location":"memory/StorageMemoryPool/#creating-instance","title":"Creating Instance","text":"

                                                                                                                      StorageMemoryPool takes the following to be created:

                                                                                                                      • Lock Object
                                                                                                                      • MemoryMode (ON_HEAP or OFF_HEAP)

                                                                                                                        StorageMemoryPool is created\u00a0when:

                                                                                                                        • MemoryManager is created (and initializes on-heap and off-heap storage memory pools)
                                                                                                                        "},{"location":"memory/StorageMemoryPool/#memorystore","title":"MemoryStore

                                                                                                                        StorageMemoryPool is given a MemoryStore when MemoryManager is requested to associate one with the on- and off-heap storage memory pools.

                                                                                                                        StorageMemoryPool uses the MemoryStore (to evict blocks) when requested to:

                                                                                                                        • Acquire Memory
                                                                                                                        • Free Space to Shrink Pool
                                                                                                                        ","text":""},{"location":"memory/StorageMemoryPool/#size-of-memory-used","title":"Size of Memory Used

                                                                                                                        StorageMemoryPool keeps track of the size of the memory acquired.

                                                                                                                        The size descreases when StorageMemoryPool is requested to releaseMemory or releaseAllMemory.

                                                                                                                        memoryUsed is part of the MemoryPool abstraction.

                                                                                                                        ","text":""},{"location":"memory/StorageMemoryPool/#acquiring-memory","title":"Acquiring Memory
                                                                                                                        acquireMemory(\n  blockId: BlockId,\n  numBytes: Long): Boolean\nacquireMemory(\n  blockId: BlockId,\n  numBytesToAcquire: Long,\n  numBytesToFree: Long): Boolean\n

                                                                                                                        acquireMemory...FIXME

                                                                                                                        acquireMemory\u00a0is used when:

                                                                                                                        • UnifiedMemoryManager is requested to acquire storage memory
                                                                                                                        ","text":""},{"location":"memory/StorageMemoryPool/#freeing-space-to-shrink-pool","title":"Freeing Space to Shrink Pool
                                                                                                                        freeSpaceToShrinkPool(\n  spaceToFree: Long): Long\n

                                                                                                                        freeSpaceToShrinkPool...FIXME

                                                                                                                        freeSpaceToShrinkPool\u00a0is used when:

                                                                                                                        • UnifiedMemoryManager is requested to acquire execution memory
                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/","title":"TaskMemoryManager","text":"

                                                                                                                        TaskMemoryManager manages the memory allocated to a single task (using MemoryManager).

                                                                                                                        TaskMemoryManager assumes that:

                                                                                                                        1. The number of bits to address pages is 13
                                                                                                                        2. The number of bits to encode offsets in pages is 51 (64 bits - 13 bits)
                                                                                                                        3. Number of pages in the page table and to be allocated is 8192 (1 << 13)
                                                                                                                        4. The maximum page size is 15GB (((1L << 31) - 1) * 8L)"},{"location":"memory/TaskMemoryManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                          TaskMemoryManager takes the following to be created:

                                                                                                                          • MemoryManager
                                                                                                                          • Task Attempt ID

                                                                                                                            TaskMemoryManager is created\u00a0when:

                                                                                                                            • TaskRunner is requested to run

                                                                                                                            "},{"location":"memory/TaskMemoryManager/#memorymanager","title":"MemoryManager

                                                                                                                            TaskMemoryManager is given a MemoryManager when created.

                                                                                                                            TaskMemoryManager uses the MemoryManager\u00a0when requested for the following:

                                                                                                                            • Acquiring, releasing or cleaning up execution memory
                                                                                                                            • Report memory usage
                                                                                                                            • pageSizeBytes
                                                                                                                            • Allocating a memory block for Tungsten consumers
                                                                                                                            • freePage
                                                                                                                            • getMemoryConsumptionForThisTask
                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#page-table-memoryblocks","title":"Page Table (MemoryBlocks)

                                                                                                                            TaskMemoryManager uses an array of MemoryBlocks (to mimic an operating system's page table).

                                                                                                                            The page table uses 13 bits for addressing pages.

                                                                                                                            A page is \"stored\" in allocatePage and \"removed\" in freePage.

                                                                                                                            All pages are released (removed) in cleanUpAllAllocatedMemory.

                                                                                                                            TaskMemoryManager uses the page table when requested to:

                                                                                                                            • getPage
                                                                                                                            • getOffsetInPage
                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#spillable-memory-consumers","title":"Spillable Memory Consumers
                                                                                                                            HashSet<MemoryConsumer> consumers\n

                                                                                                                            TaskMemoryManager tracks spillable memory consumers.

                                                                                                                            TaskMemoryManager registers a new memory consumer when requested to acquire execution memory.

                                                                                                                            TaskMemoryManager removes (clears) all registered memory consumers when cleaning up all the allocated memory.

                                                                                                                            Memory consumers are used to report memory usage when TaskMemoryManager is requested to show memory usage.

                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#memory-acquired-but-not-used","title":"Memory Acquired But Not Used

                                                                                                                            TaskMemoryManager tracks the size of memory allocated but not used (by any of the MemoryConsumers due to a OutOfMemoryError upon trying to use it).

                                                                                                                            TaskMemoryManager releases the memory when cleaning up all the allocated memory.

                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#allocated-pages","title":"Allocated Pages
                                                                                                                            BitSet allocatedPages\n

                                                                                                                            TaskMemoryManager uses a BitSet (Java) to track allocated pages.

                                                                                                                            The size is exactly the number of entries in the page table (8192).

                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#memorymode","title":"MemoryMode

                                                                                                                            TaskMemoryManager can be in ON_HEAP or OFF_HEAP mode (to avoid extra work for off-heap and hoping that the JIT handles branching well).

                                                                                                                            TaskMemoryManager is given the MemoryMode matching the MemoryMode (of the given MemoryManager) when created.

                                                                                                                            TaskMemoryManager uses the MemoryMode to match to for the following:

                                                                                                                            • allocatePage
                                                                                                                            • cleanUpAllAllocatedMemory

                                                                                                                            For OFF_HEAP mode, TaskMemoryManager has to change offset while encodePageNumberAndOffset and getOffsetInPage.

                                                                                                                            For OFF_HEAP mode, TaskMemoryManager returns no page.

                                                                                                                            The MemoryMode is used when:

                                                                                                                            • ShuffleExternalSorter is created
                                                                                                                            • BytesToBytesMap is created
                                                                                                                            • UnsafeExternalSorter is created
                                                                                                                            • Spillable is requested to spill (only when in ON_HEAP mode)
                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#acquiring-execution-memory","title":"Acquiring Execution Memory
                                                                                                                            long acquireExecutionMemory(\n  long required,\n  MemoryConsumer consumer)\n

                                                                                                                            acquireExecutionMemory allocates up to required execution memory (bytes) for the MemoryConsumer (from the MemoryManager).

                                                                                                                            When not enough memory could be allocated initially, acquireExecutionMemory requests every consumer (with the same MemoryMode, itself including) to spill.

                                                                                                                            acquireExecutionMemory returns the amount of memory allocated.

                                                                                                                            acquireExecutionMemory\u00a0is used when:

                                                                                                                            • MemoryConsumer is requested to acquire execution memory
                                                                                                                            • TaskMemoryManager is requested to allocate a page

                                                                                                                            acquireExecutionMemory requests the MemoryManager to acquire execution memory (with required bytes, the taskAttemptId and the MemoryMode of the MemoryConsumer).

                                                                                                                            In the end, acquireExecutionMemory registers the MemoryConsumer (and adds it to the consumers registry) and prints out the following DEBUG message to the logs:

                                                                                                                            Task [taskAttemptId] acquired [got] for [consumer]\n

                                                                                                                            In case MemoryManager will have offerred less memory than required, acquireExecutionMemory finds the MemoryConsumers (in the consumers registry) with the MemoryMode and non-zero memory used, sorts them by memory usage, requests them (one by one) to spill until enough memory is acquired or there are no more consumers to release memory from (by spilling).

                                                                                                                            When a MemoryConsumer releases memory, acquireExecutionMemory prints out the following DEBUG message to the logs:

                                                                                                                            Task [taskAttemptId] released [released] from [c] for [consumer]\n

                                                                                                                            In case there is still not enough memory (less than required), acquireExecutionMemory requests the MemoryConsumer (to acquire memory for) to spill.

                                                                                                                            acquireExecutionMemory prints out the following DEBUG message to the logs:

                                                                                                                            Task [taskAttemptId] released [released] from itself ([consumer])\n
                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#releasing-execution-memory","title":"Releasing Execution Memory
                                                                                                                            void releaseExecutionMemory(\n  long size,\n  MemoryConsumer consumer)\n

                                                                                                                            releaseExecutionMemory prints out the following DEBUG message to the logs:

                                                                                                                            Task [taskAttemptId] release [size] from [consumer]\n

                                                                                                                            In the end, releaseExecutionMemory requests the MemoryManager to releaseExecutionMemory.

                                                                                                                            releaseExecutionMemory is used when:

                                                                                                                            • MemoryConsumer is requested to free up memory
                                                                                                                            • TaskMemoryManager is requested to allocatePage and freePage
                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#pageSizeBytes","title":"Page Size
                                                                                                                            long pageSizeBytes()\n

                                                                                                                            pageSizeBytes requests the MemoryManager for the page size.

                                                                                                                            pageSizeBytes is used when:

                                                                                                                            • MemoryConsumer is created
                                                                                                                            • ShuffleExternalSorter is created (as a MemoryConsumer)
                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#reporting-memory-usage","title":"Reporting Memory Usage
                                                                                                                            void showMemoryUsage()\n

                                                                                                                            showMemoryUsage prints out the following INFO message to the logs (with the taskAttemptId):

                                                                                                                            Memory used in task [taskAttemptId]\n

                                                                                                                            showMemoryUsage requests every MemoryConsumer to report memory used. For consumers with non-zero memory usage, showMemoryUsage prints out the following INFO message to the logs:

                                                                                                                            Acquired by [consumer]: [memUsage]\n

                                                                                                                            showMemoryUsage requests the MemoryManager to getExecutionMemoryUsageForTask to calculate memory not accounted for (that is not associated with a specific consumer).

                                                                                                                            showMemoryUsage prints out the following INFO messages to the logs:

                                                                                                                            [memoryNotAccountedFor] bytes of memory were used by task [taskAttemptId] but are not associated with specific consumers\n

                                                                                                                            showMemoryUsage requests the MemoryManager for the executionMemoryUsed and storageMemoryUsed and prints out the following INFO message to the logs:

                                                                                                                            [executionMemoryUsed] bytes of memory are used for execution and\n[storageMemoryUsed] bytes of memory are used for storage\n

                                                                                                                            showMemoryUsage is used when:

                                                                                                                            • MemoryConsumer is requested to throw an OutOfMemoryError
                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#cleaning-up-all-allocated-memory","title":"Cleaning Up All Allocated Memory
                                                                                                                            long cleanUpAllAllocatedMemory()\n

                                                                                                                            The consumers collection is then cleared.

                                                                                                                            cleanUpAllAllocatedMemory finds all the registered MemoryConsumers (in the consumers registry) that still keep some memory used and, for every such consumer, prints out the following DEBUG message to the logs:

                                                                                                                            unreleased [getUsed] memory from [consumer]\n

                                                                                                                            cleanUpAllAllocatedMemory removes all the consumers.

                                                                                                                            For every MemoryBlock in the pageTable, cleanUpAllAllocatedMemory prints out the following DEBUG message to the logs:

                                                                                                                            unreleased page: [page] in task [taskAttemptId]\n

                                                                                                                            cleanUpAllAllocatedMemory marks the pages to be freed (FREED_IN_TMM_PAGE_NUMBER) and requests the MemoryManager for the tungstenMemoryAllocator to free up the MemoryBlock.

                                                                                                                            cleanUpAllAllocatedMemory clears the pageTable registry (by assigning null values).

                                                                                                                            cleanUpAllAllocatedMemory requests the MemoryManager to release execution memory that is not used by any consumer (with the acquiredButNotUsed and the tungstenMemoryMode).

                                                                                                                            In the end, cleanUpAllAllocatedMemory requests the MemoryManager to release all execution memory for the task.

                                                                                                                            cleanUpAllAllocatedMemory\u00a0is used when:

                                                                                                                            • TaskRunner is requested to run a task (and the task has finished successfully)
                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#allocating-memory-page","title":"Allocating Memory Page
                                                                                                                            MemoryBlock allocatePage(\n  long size,\n  MemoryConsumer consumer)\n

                                                                                                                            allocatePage allocates a block of memory (page) that is:

                                                                                                                            1. Below MAXIMUM_PAGE_SIZE_BYTES maximum size
                                                                                                                            2. For MemoryConsumers with the same MemoryMode as the TaskMemoryManager

                                                                                                                            allocatePage acquireExecutionMemory (for the size and the MemoryConsumer). allocatePage returns immediately (with null) when this allocation ended up with 0 or less bytes.

                                                                                                                            allocatePage allocates the first clear bit in the allocatedPages (unless the whole page table is taken and allocatePage throws an IllegalStateException).

                                                                                                                            allocatePage requests the MemoryManager for the tungstenMemoryAllocator that is requested to allocate the acquired memory.

                                                                                                                            allocatePage registers the page in the pageTable.

                                                                                                                            In the end, allocatePage prints out the following TRACE message to the logs and returns the MemoryBlock allocated.

                                                                                                                            Allocate page number [pageNumber] ([acquired] bytes)\n
                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#usage","title":"Usage

                                                                                                                            allocatePage is used when:

                                                                                                                            • MemoryConsumer is requested to allocate an array and a page
                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#toolargepageexception","title":"TooLargePageException

                                                                                                                            For sizes larger than the MAXIMUM_PAGE_SIZE_BYTES allocatePage throws a TooLargePageException.

                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#outofmemoryerror","title":"OutOfMemoryError

                                                                                                                            Requesting the tungstenMemoryAllocator to allocate the acquired memory may throw an OutOfMemoryError. If so, allocatePage prints out the following WARN message to the logs:

                                                                                                                            Failed to allocate a page ([acquired] bytes), try again.\n

                                                                                                                            allocatePage adds the acquired memory to the acquiredButNotUsed and removes the page from the allocatedPages (by clearing the bit).

                                                                                                                            In the end, allocatePage tries to allocate the page again (recursively).

                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#releasing-memory-page","title":"Releasing Memory Page
                                                                                                                            void freePage(\n  MemoryBlock page,\n  MemoryConsumer consumer)\n

                                                                                                                            pageSizeBytes requests the MemoryManager for pageSizeBytes.

                                                                                                                            pageSizeBytes is used when:

                                                                                                                            • MemoryConsumer is requested to freePage and throwOom
                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#getting-page","title":"Getting Page
                                                                                                                            Object getPage(\n  long pagePlusOffsetAddress)\n

                                                                                                                            getPage handles the ON_HEAP mode of the tungstenMemoryMode only.

                                                                                                                            getPage looks up the page (by the given address) in the page table and requests it for the base object.

                                                                                                                            getPage is used when:

                                                                                                                            • ShuffleExternalSorter is requested to writeSortedFile
                                                                                                                            • Location (of BytesToBytesMap) is requested to updateAddressesAndSizes
                                                                                                                            • SortComparator (of UnsafeInMemorySorter) is requested to compare two record pointers
                                                                                                                            • SortedIterator (of UnsafeInMemorySorter) is requested to loadNext record
                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#getoffsetinpage","title":"getOffsetInPage
                                                                                                                            long getOffsetInPage(\n  long pagePlusOffsetAddress)\n

                                                                                                                            getOffsetInPage gives the offset associated with the given pagePlusOffsetAddress (encoded by encodePageNumberAndOffset).

                                                                                                                            getOffsetInPage is used when:

                                                                                                                            • ShuffleExternalSorter is requested to writeSortedFile
                                                                                                                            • Location (of BytesToBytesMap) is requested to updateAddressesAndSizes
                                                                                                                            • SortComparator (of UnsafeInMemorySorter) is requested to compare two record pointers
                                                                                                                            • SortedIterator (of UnsafeInMemorySorter) is requested to loadNext record
                                                                                                                            ","text":""},{"location":"memory/TaskMemoryManager/#logging","title":"Logging

                                                                                                                            Enable ALL logging level for org.apache.spark.memory.TaskMemoryManager logger to see what happens inside.

                                                                                                                            Add the following line to conf/log4j.properties:

                                                                                                                            log4j.logger.org.apache.spark.memory.TaskMemoryManager=ALL\n

                                                                                                                            Refer to Logging.

                                                                                                                            ","text":""},{"location":"memory/UnifiedMemoryManager/","title":"UnifiedMemoryManager","text":"

                                                                                                                            UnifiedMemoryManager is a MemoryManager (with the onHeapExecutionMemory being the Maximum Heap Memory with the onHeapStorageRegionSize taken out).

                                                                                                                            UnifiedMemoryManager allows for soft boundaries between storage and execution memory (allowing requests for memory in one region to be fulfilled by borrowing memory from the other).

                                                                                                                            "},{"location":"memory/UnifiedMemoryManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                            UnifiedMemoryManager takes the following to be created:

                                                                                                                            • SparkConf
                                                                                                                            • Maximum Heap Memory
                                                                                                                            • Size of the On-Heap Storage Region
                                                                                                                            • Number of CPU Cores

                                                                                                                              While being created, UnifiedMemoryManager asserts the invariants.

                                                                                                                              UnifiedMemoryManager is created\u00a0using apply factory.

                                                                                                                              "},{"location":"memory/UnifiedMemoryManager/#invariants","title":"Invariants

                                                                                                                              UnifiedMemoryManager asserts the following:

                                                                                                                              • Sum of the pool size of the on-heap ExecutionMemoryPool and on-heap StorageMemoryPool is exactly the maximum heap memory

                                                                                                                              • Sum of the pool size of the off-heap ExecutionMemoryPool and off-heap StorageMemoryPool is exactly the maximum off-heap memory

                                                                                                                              ","text":""},{"location":"memory/UnifiedMemoryManager/#total-available-on-heap-memory-for-storage","title":"Total Available On-Heap Memory for Storage
                                                                                                                              maxOnHeapStorageMemory: Long\n

                                                                                                                              maxOnHeapStorageMemory\u00a0is part of the MemoryManager abstraction.

                                                                                                                              maxOnHeapStorageMemory is the difference between Maximum Heap Memory and the memory used in the on-heap execution memory pool.

                                                                                                                              ","text":""},{"location":"memory/UnifiedMemoryManager/#size-of-the-on-heap-storage-memory","title":"Size of the On-Heap Storage Memory

                                                                                                                              UnifiedMemoryManager is given the size of the on-heap storage memory (region) when created.

                                                                                                                              The size is the fraction (based on spark.memory.storageFraction configuration property) of the maximum heap memory.

                                                                                                                              The remaining memory space (of the maximum heap memory) is used for the on-heap execution memory.

                                                                                                                              ","text":""},{"location":"memory/UnifiedMemoryManager/#creating-unifiedmemorymanager","title":"Creating UnifiedMemoryManager
                                                                                                                              apply(\n  conf: SparkConf,\n  numCores: Int): UnifiedMemoryManager\n

                                                                                                                              apply creates a UnifiedMemoryManager with the Maximum Heap Memory and the size of the on-heap storage region as spark.memory.storageFraction of the Maximum Memory.

                                                                                                                              apply\u00a0is used when:

                                                                                                                              • SparkEnv utility is used to create a base SparkEnv (for the driver and executors)
                                                                                                                              ","text":""},{"location":"memory/UnifiedMemoryManager/#maximum-heap-memory","title":"Maximum Heap Memory

                                                                                                                              UnifiedMemoryManager is given the maximum heap memory to use (for execution and storage) when created (that uses apply factory method which uses getMaxMemory).

                                                                                                                              UnifiedMemoryManager makes sure that the driver's system memory is at least 1.5 of the Reserved System Memory. Otherwise, getMaxMemory throws an IllegalArgumentException:

                                                                                                                              System memory [systemMemory] must be at least [minSystemMemory].\nPlease increase heap size using the --driver-memory option or spark.driver.memory in Spark configuration.\n

                                                                                                                              UnifiedMemoryManager makes sure that the executor memory (spark.executor.memory) is at least the Reserved System Memory. Otherwise, getMaxMemory throws an IllegalArgumentException:

                                                                                                                              Executor memory [executorMemory] must be at least [minSystemMemory].\nPlease increase executor memory using the --executor-memory option or spark.executor.memory in Spark configuration.\n

                                                                                                                              UnifiedMemoryManager considers \"usable\" memory to be the system memory without the reserved memory.

                                                                                                                              UnifiedMemoryManager uses the fraction (based on spark.memory.fraction configuration property) of the \"usable\" memory for the maximum heap memory.

                                                                                                                              ","text":""},{"location":"memory/UnifiedMemoryManager/#demo","title":"Demo
                                                                                                                              // local mode with --conf spark.driver.memory=2g\nscala> sc.getConf.getSizeAsBytes(\"spark.driver.memory\")\nres0: Long = 2147483648\n\nscala> val systemMemory = Runtime.getRuntime.maxMemory\n\n// fixed amount of memory for non-storage, non-execution purposes\n// UnifiedMemoryManager.RESERVED_SYSTEM_MEMORY_BYTES\nval reservedMemory = 300 * 1024 * 1024\n\n// minimum system memory required\nval minSystemMemory = (reservedMemory * 1.5).ceil.toLong\n\nval usableMemory = systemMemory - reservedMemory\n\nval memoryFraction = sc.getConf.getDouble(\"spark.memory.fraction\", 0.6)\nscala> val maxMemory = (usableMemory * memoryFraction).toLong\nmaxMemory: Long = 956615884\n\nimport org.apache.spark.network.util.JavaUtils\nscala> JavaUtils.byteStringAsMb(maxMemory + \"b\")\nres1: Long = 912\n
                                                                                                                              ","text":""},{"location":"memory/UnifiedMemoryManager/#reserved-system-memory","title":"Reserved System Memory

                                                                                                                              UnifiedMemoryManager considers 300MB (300 * 1024 * 1024 bytes) as a reserved system memory while calculating the maximum heap memory.

                                                                                                                              ","text":""},{"location":"memory/UnifiedMemoryManager/#acquiring-execution-memory-for-task","title":"Acquiring Execution Memory for Task
                                                                                                                              acquireExecutionMemory(\n  numBytes: Long,\n  taskAttemptId: Long,\n  memoryMode: MemoryMode): Long\n

                                                                                                                              acquireExecutionMemory asserts the invariants.

                                                                                                                              acquireExecutionMemory selects the execution and storage pools, the storage region size and the maximum memory for the given MemoryMode.

                                                                                                                              MemoryMode ON_HEAP OFF_HEAP executionPool onHeapExecutionMemoryPool offHeapExecutionMemoryPool storagePool onHeapStorageMemoryPool offHeapStorageMemoryPool storageRegionSize onHeapStorageRegionSize offHeapStorageMemory maxMemory maxHeapMemory maxOffHeapMemory

                                                                                                                              In the end, acquireExecutionMemory requests the ExecutionMemoryPool to acquire memory of numBytes bytes (with the maybeGrowExecutionPool and the maximum size of execution pool functions).

                                                                                                                              acquireExecutionMemory\u00a0is part of the MemoryManager abstraction.

                                                                                                                              ","text":""},{"location":"memory/UnifiedMemoryManager/#maybegrowexecutionpool","title":"maybeGrowExecutionPool
                                                                                                                              maybeGrowExecutionPool(\n  extraMemoryNeeded: Long): Unit\n

                                                                                                                              maybeGrowExecutionPool...FIXME

                                                                                                                              ","text":""},{"location":"memory/UnifiedMemoryManager/#maximum-size-of-execution-pool","title":"Maximum Size of Execution Pool
                                                                                                                              computeMaxExecutionPoolSize(): Long\n

                                                                                                                              computeMaxExecutionPoolSize takes the minimum size of the storage memory regions (based on the memory mode, ON_HEAP or OFF_HEAP, respectively):

                                                                                                                              • Memory used of the on-heap or the off-heap storage memory pool
                                                                                                                              • On-heap or the off-heap storage memory size

                                                                                                                              In the end, computeMaxExecutionPoolSize returns the size of the remaining memory space of the maximum memory (the maxHeapMemory or the maxOffHeapMemory for ON_HEAP or OFF_HEAP memory mode, respectively) without (the minimum size of) the storage memory region.

                                                                                                                              ","text":""},{"location":"memory/UnsafeExternalSorter/","title":"UnsafeExternalSorter","text":"

                                                                                                                              UnsafeExternalSorter is a MemoryConsumer.

                                                                                                                              "},{"location":"memory/UnsafeExternalSorter/#creating-instance","title":"Creating Instance","text":"

                                                                                                                              UnsafeExternalSorter takes the following to be created:

                                                                                                                              • TaskMemoryManager
                                                                                                                              • BlockManager
                                                                                                                              • SerializerManager
                                                                                                                              • TaskContext
                                                                                                                              • RecordComparator Supplier
                                                                                                                              • PrefixComparator
                                                                                                                              • Initial Size
                                                                                                                              • Page size (in bytes)
                                                                                                                              • numElementsForSpillThreshold
                                                                                                                              • UnsafeInMemorySorter
                                                                                                                              • canUseRadixSort flag

                                                                                                                                UnsafeExternalSorter is created\u00a0when:

                                                                                                                                • UnsafeExternalSorter utility is used to createWithExistingInMemorySorter and create
                                                                                                                                "},{"location":"memory/UnsafeExternalSorter/#createwithexistinginmemorysorter","title":"createWithExistingInMemorySorter
                                                                                                                                UnsafeExternalSorter createWithExistingInMemorySorter(\n  TaskMemoryManager taskMemoryManager,\n  BlockManager blockManager,\n  SerializerManager serializerManager,\n  TaskContext taskContext,\n  Supplier<RecordComparator> recordComparatorSupplier,\n  PrefixComparator prefixComparator,\n  int initialSize,\n  long pageSizeBytes,\n  int numElementsForSpillThreshold,\n  UnsafeInMemorySorter inMemorySorter,\n  long existingMemoryConsumption)\n

                                                                                                                                createWithExistingInMemorySorter...FIXME

                                                                                                                                createWithExistingInMemorySorter\u00a0is used when:

                                                                                                                                • UnsafeKVExternalSorter is created
                                                                                                                                ","text":""},{"location":"memory/UnsafeExternalSorter/#create","title":"create
                                                                                                                                UnsafeExternalSorter create(\n  TaskMemoryManager taskMemoryManager,\n  BlockManager blockManager,\n  SerializerManager serializerManager,\n  TaskContext taskContext,\n  Supplier<RecordComparator> recordComparatorSupplier,\n  PrefixComparator prefixComparator,\n  int initialSize,\n  long pageSizeBytes,\n  int numElementsForSpillThreshold,\n  boolean canUseRadixSort)\n

                                                                                                                                create creates a new UnsafeExternalSorter with no UnsafeInMemorySorter given (null).

                                                                                                                                create\u00a0is used when:

                                                                                                                                • UnsafeExternalRowSorter and UnsafeKVExternalSorter are created
                                                                                                                                ","text":""},{"location":"memory/UnsafeInMemorySorter/","title":"UnsafeInMemorySorter","text":""},{"location":"memory/UnsafeInMemorySorter/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                UnsafeInMemorySorter takes the following to be created:

                                                                                                                                • MemoryConsumer
                                                                                                                                • TaskMemoryManager
                                                                                                                                • RecordComparator
                                                                                                                                • PrefixComparator
                                                                                                                                • Long Array or Size
                                                                                                                                • canUseRadixSort flag

                                                                                                                                  UnsafeInMemorySorter is created\u00a0when:

                                                                                                                                  • UnsafeExternalSorter is created
                                                                                                                                  • UnsafeKVExternalSorter is created
                                                                                                                                  "},{"location":"memory/UnsafeSorterSpillReader/","title":"UnsafeSorterSpillReader","text":"

                                                                                                                                  = UnsafeSorterSpillReader

                                                                                                                                  UnsafeSorterSpillReader is...FIXME

                                                                                                                                  "},{"location":"memory/UnsafeSorterSpillWriter/","title":"UnsafeSorterSpillWriter","text":"

                                                                                                                                  = [[UnsafeSorterSpillWriter]] UnsafeSorterSpillWriter

                                                                                                                                  UnsafeSorterSpillWriter is...FIXME

                                                                                                                                  "},{"location":"metrics/","title":"Spark Metrics","text":"

                                                                                                                                  Spark Metrics gives you execution metrics of Spark subsystems (metrics instances, e.g. the driver of a Spark application or the master of a Spark Standalone cluster).

                                                                                                                                  Spark Metrics uses Dropwizard Metrics Java library for the metrics infrastructure.

                                                                                                                                  Metrics is a Java library which gives you unparalleled insight into what your code does in production.

                                                                                                                                  Metrics provides a powerful toolkit of ways to measure the behavior of critical components in your production environment.

                                                                                                                                  "},{"location":"metrics/#metrics-systems","title":"Metrics Systems","text":""},{"location":"metrics/#applicationmaster","title":"applicationMaster","text":"

                                                                                                                                  Registered when ApplicationMaster (Hadoop YARN) is requested to createAllocator

                                                                                                                                  "},{"location":"metrics/#applications","title":"applications","text":"

                                                                                                                                  Registered when Master (Spark Standalone) is created

                                                                                                                                  "},{"location":"metrics/#driver","title":"driver","text":"

                                                                                                                                  Registered when SparkEnv is created for the driver

                                                                                                                                  "},{"location":"metrics/#executor","title":"executor","text":"

                                                                                                                                  Registered when SparkEnv is created for an executor

                                                                                                                                  "},{"location":"metrics/#master","title":"master","text":"

                                                                                                                                  Registered when Master (Spark Standalone) is created

                                                                                                                                  "},{"location":"metrics/#mesos_cluster","title":"mesos_cluster","text":"

                                                                                                                                  Registered when MesosClusterScheduler (Apache Mesos) is created

                                                                                                                                  "},{"location":"metrics/#shuffleservice","title":"shuffleService","text":"

                                                                                                                                  Registered when ExternalShuffleService is created

                                                                                                                                  "},{"location":"metrics/#worker","title":"worker","text":"

                                                                                                                                  Registered when Worker (Spark Standalone) is created

                                                                                                                                  "},{"location":"metrics/#metricssystem","title":"MetricsSystem

                                                                                                                                  Spark Metrics uses MetricsSystem.

                                                                                                                                  MetricsSystem uses Dropwizard Metrics' MetricRegistry that acts as the integration point between Spark and the metrics library.

                                                                                                                                  A Spark subsystem can access the MetricsSystem through the SparkEnv.metricsSystem property.

                                                                                                                                  val metricsSystem = SparkEnv.get.metricsSystem\n
                                                                                                                                  ","text":""},{"location":"metrics/#metricsconfig","title":"MetricsConfig

                                                                                                                                  MetricsConfig is the configuration of the MetricsSystem (i.e. metrics spark-metrics-Source.md[sources] and spark-metrics-Sink.md[sinks]).

                                                                                                                                  metrics.properties is the default metrics configuration file. It is configured using spark-metrics-properties.md#spark.metrics.conf[spark.metrics.conf] configuration property. The file is first loaded from the path directly before using Spark's CLASSPATH.

                                                                                                                                  MetricsConfig also accepts a metrics configuration using spark.metrics.conf.-prefixed configuration properties.

                                                                                                                                  Spark comes with conf/metrics.properties.template file that is a template of metrics configuration.

                                                                                                                                  ","text":""},{"location":"metrics/#metricsservlet-metrics-sink","title":"MetricsServlet Metrics Sink

                                                                                                                                  Among the metrics sinks is spark-metrics-MetricsServlet.md[MetricsServlet] that is used when sink.servlet metrics sink is configured in spark-metrics-MetricsConfig.md[metrics configuration].

                                                                                                                                  CAUTION: FIXME Describe configuration files and properties

                                                                                                                                  ","text":""},{"location":"metrics/#jmxsink-metrics-sink","title":"JmxSink Metrics Sink

                                                                                                                                  Enable org.apache.spark.metrics.sink.JmxSink in spark-metrics-MetricsConfig.md[metrics configuration].

                                                                                                                                  You can then use jconsole to access Spark metrics through JMX.

                                                                                                                                  *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink\n

                                                                                                                                  ","text":""},{"location":"metrics/#json-uri-path","title":"JSON URI Path

                                                                                                                                  Metrics System is available at http://localhost:4040/metrics/json (for the default setup of a Spark application).

                                                                                                                                  $ http --follow http://localhost:4040/metrics/json\nHTTP/1.1 200 OK\nCache-Control: no-cache, no-store, must-revalidate\nContent-Length: 2200\nContent-Type: text/json;charset=utf-8\nDate: Sat, 25 Feb 2017 14:14:16 GMT\nServer: Jetty(9.2.z-SNAPSHOT)\nX-Frame-Options: SAMEORIGIN\n\n{\n    \"counters\": {\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.fileCacheHits\": {\n            \"count\": 0\n        },\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.filesDiscovered\": {\n            \"count\": 0\n        },\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.hiveClientCalls\": {\n            \"count\": 2\n        },\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.parallelListingJobCount\": {\n            \"count\": 0\n        },\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.partitionsFetched\": {\n            \"count\": 0\n        }\n    },\n    \"gauges\": {\n    ...\n    \"timers\": {\n        \"app-20170225151406-0000.driver.DAGScheduler.messageProcessingTime\": {\n            \"count\": 0,\n            \"duration_units\": \"milliseconds\",\n            \"m15_rate\": 0.0,\n            \"m1_rate\": 0.0,\n            \"m5_rate\": 0.0,\n            \"max\": 0.0,\n            \"mean\": 0.0,\n            \"mean_rate\": 0.0,\n            \"min\": 0.0,\n            \"p50\": 0.0,\n            \"p75\": 0.0,\n            \"p95\": 0.0,\n            \"p98\": 0.0,\n            \"p99\": 0.0,\n            \"p999\": 0.0,\n            \"rate_units\": \"calls/second\",\n            \"stddev\": 0.0\n        }\n    },\n    \"version\": \"3.0.0\"\n}\n

                                                                                                                                  NOTE: You can access a Spark subsystem's MetricsSystem using its corresponding \"leading\" port, e.g. 4040 for the driver, 8080 for Spark Standalone's master and applications.

                                                                                                                                  NOTE: You have to use the trailing slash (/) to have the output.

                                                                                                                                  ","text":""},{"location":"metrics/#spark-standalone-master","title":"Spark Standalone Master
                                                                                                                                  $ http http://192.168.1.4:8080/metrics/master/json/path\nHTTP/1.1 200 OK\nCache-Control: no-cache, no-store, must-revalidate\nContent-Length: 207\nContent-Type: text/json;charset=UTF-8\nServer: Jetty(8.y.z-SNAPSHOT)\nX-Frame-Options: SAMEORIGIN\n\n{\n    \"counters\": {},\n    \"gauges\": {\n        \"master.aliveWorkers\": {\n            \"value\": 0\n        },\n        \"master.apps\": {\n            \"value\": 0\n        },\n        \"master.waitingApps\": {\n            \"value\": 0\n        },\n        \"master.workers\": {\n            \"value\": 0\n        }\n    },\n    \"histograms\": {},\n    \"meters\": {},\n    \"timers\": {},\n    \"version\": \"3.0.0\"\n}\n
                                                                                                                                  ","text":""},{"location":"metrics/JvmSource/","title":"JvmSource","text":"

                                                                                                                                  JvmSource is a metrics source.

                                                                                                                                  The name of the source is jvm.

                                                                                                                                  JvmSource registers the build-in Codahale metrics:

                                                                                                                                  • GarbageCollectorMetricSet
                                                                                                                                  • MemoryUsageGaugeSet
                                                                                                                                  • BufferPoolMetricSet

                                                                                                                                  Among the metrics is total.committed (from MemoryUsageGaugeSet) that describes the current usage of the heap and non-heap memories.

                                                                                                                                  "},{"location":"metrics/MetricsConfig/","title":"MetricsConfig","text":"

                                                                                                                                  MetricsConfig is the configuration of the MetricsSystem (i.e. metrics sources and sinks).

                                                                                                                                  MetricsConfig is <> when MetricsSystem is.

                                                                                                                                  MetricsConfig uses metrics.properties as the default metrics configuration file. It is configured using spark-metrics-properties.md#spark.metrics.conf[spark.metrics.conf] configuration property. The file is first loaded from the path directly before using Spark's CLASSPATH.

                                                                                                                                  MetricsConfig accepts a metrics configuration using spark.metrics.conf.-prefixed configuration properties.

                                                                                                                                  Spark comes with conf/metrics.properties.template file that is a template of metrics configuration.

                                                                                                                                  MetricsConfig <> that the <> are always defined.

                                                                                                                                  [[default-properties]] .MetricsConfig's Default Metrics Properties [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                  | *.sink.servlet.class | org.apache.spark.metrics.sink.MetricsServlet

                                                                                                                                  | *.sink.servlet.path | /metrics/json

                                                                                                                                  | master.sink.servlet.path | /metrics/master/json

                                                                                                                                  | applications.sink.servlet.path | /metrics/applications/json |===

                                                                                                                                  "},{"location":"metrics/MetricsConfig/#note","title":"[NOTE]","text":"

                                                                                                                                  The order of precedence of metrics configuration settings is as follows:

                                                                                                                                  . <> . spark-metrics-properties.md#spark.metrics.conf[spark.metrics.conf] configuration property or metrics.properties configuration file . spark.metrics.conf.-prefixed Spark properties ====

                                                                                                                                  [[creating-instance]] [[conf]] MetricsConfig takes a SparkConf.md[SparkConf] when created.

                                                                                                                                  [[internal-registries]] .MetricsConfig's Internal Registries and Counters [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                  | [[properties]] properties | https://docs.oracle.com/javase/8/docs/api/java/util/Properties.html[java.util.Properties] with metrics properties

                                                                                                                                  Used to <> per-subsystem's <>.

                                                                                                                                  | [[perInstanceSubProperties]] perInstanceSubProperties | Lookup table of metrics properties per subsystem |===

                                                                                                                                  === [[initialize]] Initializing MetricsConfig -- initialize Method

                                                                                                                                  "},{"location":"metrics/MetricsConfig/#source-scala","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#initialize-unit","title":"initialize(): Unit","text":"

                                                                                                                                  initialize <> and <> (that is defined using spark-metrics-properties.md#spark.metrics.conf[spark.metrics.conf] configuration property).

                                                                                                                                  initialize takes all Spark properties that start with spark.metrics.conf. prefix from <> and adds them to <> (without the prefix).

                                                                                                                                  In the end, initialize splits <> with the default configuration (denoted as *) assigned to all subsystems afterwards.

                                                                                                                                  NOTE: initialize accepts * (star) for the default configuration or any combination of lower- and upper-case letters for Spark subsystem names.

                                                                                                                                  NOTE: initialize is used exclusively when MetricsSystem is created.

                                                                                                                                  === [[setDefaultProperties]] setDefaultProperties Internal Method

                                                                                                                                  "},{"location":"metrics/MetricsConfig/#source-scala_1","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#setdefaultpropertiesprop-properties-unit","title":"setDefaultProperties(prop: Properties): Unit","text":"

                                                                                                                                  setDefaultProperties sets the <> (in the input prop).

                                                                                                                                  NOTE: setDefaultProperties is used exclusively when MetricsConfig <>.

                                                                                                                                  === [[loadPropertiesFromFile]] Loading Custom Metrics Configuration File or metrics.properties -- loadPropertiesFromFile Method

                                                                                                                                  "},{"location":"metrics/MetricsConfig/#source-scala_2","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#loadpropertiesfromfilepath-optionstring-unit","title":"loadPropertiesFromFile(path: Option[String]): Unit","text":"

                                                                                                                                  loadPropertiesFromFile tries to open the input path file (if defined) or the default metrics configuration file metrics.properties (on CLASSPATH).

                                                                                                                                  If either file is available, loadPropertiesFromFile loads the properties (to <> registry).

                                                                                                                                  In case of exceptions, you should see the following ERROR message in the logs followed by the exception.

                                                                                                                                  ERROR Error loading configuration file [file]\n

                                                                                                                                  NOTE: loadPropertiesFromFile is used exclusively when MetricsConfig <>.

                                                                                                                                  === [[subProperties]] Grouping Properties Per Subsystem -- subProperties Method

                                                                                                                                  "},{"location":"metrics/MetricsConfig/#source-scala_3","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#subpropertiesprop-properties-regex-regex-mutablehashmapstring-properties","title":"subProperties(prop: Properties, regex: Regex): mutable.HashMap[String, Properties]","text":"

                                                                                                                                  subProperties takes prop properties and destructures keys given regex. subProperties takes the matching prefix (of a key per regex) and uses it as a new key with the value(s) being the matching suffix(es).

                                                                                                                                  "},{"location":"metrics/MetricsConfig/#source-scala_4","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#driverhelloworld-driver-helloworld","title":"driver.hello.world => (driver, (hello.world))","text":"

                                                                                                                                  NOTE: subProperties is used when MetricsConfig <> (to apply the default metrics configuration) and when MetricsSystem registers metrics sources and sinks.

                                                                                                                                  === [[getInstance]] getInstance Method

                                                                                                                                  "},{"location":"metrics/MetricsConfig/#source-scala_5","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#getinstanceinst-string-properties","title":"getInstance(inst: String): Properties","text":"

                                                                                                                                  getInstance...FIXME

                                                                                                                                  NOTE: getInstance is used when...FIXME

                                                                                                                                  "},{"location":"metrics/MetricsServlet/","title":"MetricsServlet JSON Metrics Sink","text":"

                                                                                                                                  MetricsServlet is a metrics sink that gives metrics snapshots in JSON format.

                                                                                                                                  MetricsServlet is a \"special\" sink as it is only available to the metrics instances with a web UI:

                                                                                                                                  • Driver of a Spark application
                                                                                                                                  • Spark Standalone's Master and Worker

                                                                                                                                  You can access the metrics from MetricsServlet at /metrics/json URI by default. The entire URL depends on a metrics instance, e.g. http://localhost:4040/metrics/json/ for a running Spark application.

                                                                                                                                  $ http http://localhost:4040/metrics/json/\nHTTP/1.1 200 OK\nCache-Control: no-cache, no-store, must-revalidate\nContent-Length: 5005\nContent-Type: text/json;charset=utf-8\nDate: Mon, 11 Jun 2018 06:29:03 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nX-Content-Type-Options: nosniff\nX-Frame-Options: SAMEORIGIN\nX-XSS-Protection: 1; mode=block\n\n{\n    \"counters\": {\n        \"local-1528698499919.driver.HiveExternalCatalog.fileCacheHits\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.HiveExternalCatalog.filesDiscovered\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.HiveExternalCatalog.hiveClientCalls\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.HiveExternalCatalog.parallelListingJobCount\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.HiveExternalCatalog.partitionsFetched\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.LiveListenerBus.numEventsPosted\": {\n            \"count\": 7\n        },\n        \"local-1528698499919.driver.LiveListenerBus.queue.appStatus.numDroppedEvents\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.LiveListenerBus.queue.executorManagement.numDroppedEvents\": {\n            \"count\": 0\n        }\n    },\n    ...\n

                                                                                                                                  MetricsServlet is <> exclusively when MetricsSystem is started (and requested to register metrics sinks).

                                                                                                                                  MetricsServlet can be configured using configuration properties with sink.servlet prefix (in spark-metrics-MetricsConfig.md[metrics configuration]). That is not required since MetricsConfig spark-metrics-MetricsConfig.md#setDefaultProperties[makes sure] that MetricsServlet is always configured.

                                                                                                                                  MetricsServlet uses https://fasterxml.github.io/jackson-databind/[jackson-databind], the general data-binding package for Jackson (as <>) with Dropwizard Metrics library (i.e. registering a Coda Hale MetricsModule).

                                                                                                                                  [[properties]] .MetricsServlet's Configuration Properties [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Name | Default | Description

                                                                                                                                  | path | /metrics/json/ | [[path]] Path URI prefix to bind to

                                                                                                                                  | sample | false | [[sample]] Whether to show entire set of samples for histograms |===

                                                                                                                                  [[internal-registries]] .MetricsServlet's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                  | mapper | [[mapper]] Jaxson's https://fasterxml.github.io/jackson-databind/javadoc/2.6/com/fasterxml/jackson/databind/ObjectMapper.html[com.fasterxml.jackson.databind.ObjectMapper] that \"provides functionality for reading and writing JSON, either to and from basic POJOs (Plain Old Java Objects), or to and from a general-purpose JSON Tree Model (JsonNode), as well as related functionality for performing conversions.\"

                                                                                                                                  When created, mapper is requested to register a Coda Hale com.codahale.metrics.json.MetricsModule.

                                                                                                                                  Used exclusively when MetricsServlet is requested to <>.

                                                                                                                                  | servletPath | [[servletPath]] Value of <> configuration property

                                                                                                                                  | servletShowSample | [[servletShowSample]] Flag to control whether to show samples (true) or not (false).

                                                                                                                                  servletShowSample is the value of <> configuration property (if defined) or false.

                                                                                                                                  Used when <> is requested to register a Coda Hale com.codahale.metrics.json.MetricsModule. |==="},{"location":"metrics/MetricsServlet/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                  MetricsServlet takes the following when created:

                                                                                                                                  • [[property]] Configuration Properties (as Java Properties)
                                                                                                                                  • [[registry]] MetricRegistry (Dropwizard Metrics
                                                                                                                                  • [[securityMgr]] SecurityManager

                                                                                                                                  MetricsServlet initializes the <>.

                                                                                                                                  === [[getMetricsSnapshot]] Requesting Metrics Snapshot -- getMetricsSnapshot Method

                                                                                                                                  "},{"location":"metrics/MetricsServlet/#source-scala","title":"[source, scala]","text":""},{"location":"metrics/MetricsServlet/#getmetricssnapshotrequest-httpservletrequest-string","title":"getMetricsSnapshot(request: HttpServletRequest): String","text":"

                                                                                                                                  getMetricsSnapshot simply requests the <> to serialize the <> to a JSON string (using ++https://fasterxml.github.io/jackson-databind/javadoc/2.6/com/fasterxml/jackson/databind/ObjectMapper.html#writeValueAsString-java.lang.Object-++[ObjectMapper.writeValueAsString]).

                                                                                                                                  NOTE: getMetricsSnapshot is used exclusively when MetricsServlet is requested to <>.

                                                                                                                                  === [[getHandlers]] Requesting JSON Servlet Handler -- getHandlers Method

                                                                                                                                  "},{"location":"metrics/MetricsServlet/#source-scala_1","title":"[source, scala]","text":""},{"location":"metrics/MetricsServlet/#gethandlersconf-sparkconf-arrayservletcontexthandler","title":"getHandlers(conf: SparkConf): Array[ServletContextHandler]","text":"

                                                                                                                                  getHandlers returns just a single ServletContextHandler (in a collection) that gives <> in JSON format at every request at <> URI path.

                                                                                                                                  NOTE: getHandlers is used exclusively when MetricsSystem is requested for MetricsSystem.md#getServletHandlers[metrics ServletContextHandlers].

                                                                                                                                  "},{"location":"metrics/MetricsSystem/","title":"MetricsSystem","text":"

                                                                                                                                  MetricsSystem is a registry of metrics sources and sinks of a Spark subsystem.

                                                                                                                                  "},{"location":"metrics/MetricsSystem/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                  MetricsSystem takes the following to be created:

                                                                                                                                  • Instance Name
                                                                                                                                  • SparkConf
                                                                                                                                  • SecurityManager

                                                                                                                                    While being created, MetricsSystem requests the MetricsConfig to initialize.

                                                                                                                                    MetricsSystem is created (using createMetricsSystem utility) for the Metrics Systems.

                                                                                                                                    "},{"location":"metrics/MetricsSystem/#prometheusservlet","title":"PrometheusServlet

                                                                                                                                    MetricsSystem creates a PrometheusServlet when requested to registerSinks for an instance with sink.prometheusServlet configuration.

                                                                                                                                    MetricsSystem requests the PrometheusServlet for URL handlers when requested for servlet handlers (so it can be attached to a web UI and serve HTTP requests).

                                                                                                                                    ","text":""},{"location":"metrics/MetricsSystem/#metricsservlet","title":"MetricsServlet

                                                                                                                                    Note

                                                                                                                                    review me

                                                                                                                                    MetricsServlet JSON metrics sink that is only available for the <> with a web UI (i.e. the driver of a Spark application and Spark Standalone's Master).

                                                                                                                                    MetricsSystem may have at most one MetricsServlet JSON metrics sink (which is registered by default).

                                                                                                                                    Initialized when MetricsSystem registers <> (and finds a configuration entry with servlet sink name).

                                                                                                                                    Used when MetricsSystem is requested for a <>.","text":""},{"location":"metrics/MetricsSystem/#creating-metricssystem","title":"Creating MetricsSystem

                                                                                                                                    createMetricsSystem(\n  instance: String\n  conf: SparkConf\n  securityMgr: SecurityManager): MetricsSystem\n

                                                                                                                                    createMetricsSystem creates a new MetricsSystem (for the given parameters).

                                                                                                                                    createMetricsSystem is used to create metrics systems.

                                                                                                                                    ","text":""},{"location":"metrics/MetricsSystem/#metrics-sources-for-spark-sql","title":"Metrics Sources for Spark SQL
                                                                                                                                    • CodegenMetrics
                                                                                                                                    • HiveCatalogMetrics
                                                                                                                                    ","text":""},{"location":"metrics/MetricsSystem/#registering-metrics-source","title":"Registering Metrics Source
                                                                                                                                    registerSource(\n  source: Source): Unit\n

                                                                                                                                    registerSource adds source to the sources internal registry.

                                                                                                                                    registerSource creates an identifier for the metrics source and registers it with the MetricRegistry.

                                                                                                                                    registerSource registers the metrics source under a given name.

                                                                                                                                    registerSource prints out the following INFO message to the logs when registering a name more than once:

                                                                                                                                    Metrics already registered\n
                                                                                                                                    ","text":""},{"location":"metrics/MetricsSystem/#building-metrics-source-identifier","title":"Building Metrics Source Identifier
                                                                                                                                    buildRegistryName(\n  source: Source): String\n

                                                                                                                                    buildRegistryName uses spark-metrics-properties.md#spark.metrics.namespace[spark.metrics.namespace] and executor:Executor.md#spark.executor.id[spark.executor.id] Spark properties to differentiate between a Spark application's driver and executors, and the other Spark framework's components.

                                                                                                                                    (only when <> is driver or executor) buildRegistryName builds metrics source name that is made up of spark-metrics-properties.md#spark.metrics.namespace[spark.metrics.namespace], executor:Executor.md#spark.executor.id[spark.executor.id] and the name of the source.

                                                                                                                                    FIXME Finish for the other components.

                                                                                                                                    buildRegistryName is used when MetricsSystem is requested to register or remove a metrics source.

                                                                                                                                    ","text":""},{"location":"metrics/MetricsSystem/#registering-metrics-sources-for-spark-instance","title":"Registering Metrics Sources for Spark Instance
                                                                                                                                    registerSources(): Unit\n

                                                                                                                                    registerSources finds <> configuration for the <>.

                                                                                                                                    NOTE: instance is defined when MetricsSystem <>.

                                                                                                                                    registerSources finds the configuration of all the spark-metrics-Source.md[metrics sources] for the subsystem (as described with source. prefix).

                                                                                                                                    For every metrics source, registerSources finds class property, creates an instance, and in the end <>.

                                                                                                                                    When registerSources fails, you should see the following ERROR message in the logs followed by the exception.

                                                                                                                                    Source class [classPath] cannot be instantiated\n

                                                                                                                                    registerSources is used when MetricsSystem is requested to start.

                                                                                                                                    ","text":""},{"location":"metrics/MetricsSystem/#servlet-handlers","title":"Servlet Handlers
                                                                                                                                    getServletHandlers: Array[ServletContextHandler]\n

                                                                                                                                    getServletHandlers requests the metricsServlet (if defined) and the prometheusServlet (if defined) for URL handlers.

                                                                                                                                    getServletHandlers requires that the MetricsSystem is running or throws an IllegalArgumentException:

                                                                                                                                    Can only call getServletHandlers on a running MetricsSystem\n

                                                                                                                                    getServletHandlers is used when:

                                                                                                                                    • SparkContext is created (and attaches the URL handlers to the web UI)
                                                                                                                                    • Master (Spark Standalone) is requested to onStart
                                                                                                                                    • Worker (Spark Standalone) is requested to onStart
                                                                                                                                    ","text":""},{"location":"metrics/MetricsSystem/#registering-metrics-sinks","title":"Registering Metrics Sinks
                                                                                                                                    registerSinks(): Unit\n

                                                                                                                                    registerSinks requests the <> for the spark-metrics-MetricsConfig.md#getInstance[configuration] of the <>.

                                                                                                                                    registerSinks requests the <> for the spark-metrics-MetricsConfig.md#subProperties[configuration] of all metrics sinks (i.e. configuration entries that match ^sink\\\\.(.+)\\\\.(.+) regular expression).

                                                                                                                                    For every metrics sink configuration, registerSinks takes class property and (if defined) creates an instance of the metric sink using an constructor that takes the configuration, <> and <>.

                                                                                                                                    For a single servlet metrics sink, registerSinks converts the sink to a spark-metrics-MetricsServlet.md[MetricsServlet] and sets the <> internal registry.

                                                                                                                                    For all other metrics sinks, registerSinks adds the sink to the <> internal registry.

                                                                                                                                    In case of an Exception, registerSinks prints out the following ERROR message to the logs:

                                                                                                                                    Sink class [classPath] cannot be instantiated\n

                                                                                                                                    registerSinks is used when MetricsSystem is requested to start.

                                                                                                                                    ","text":""},{"location":"metrics/MetricsSystem/#stopping","title":"Stopping
                                                                                                                                    stop(): Unit\n

                                                                                                                                    stop...FIXME

                                                                                                                                    ","text":""},{"location":"metrics/MetricsSystem/#reporting-metrics","title":"Reporting Metrics
                                                                                                                                    report(): Unit\n

                                                                                                                                    report simply requests the registered metrics sinks to report metrics.

                                                                                                                                    ","text":""},{"location":"metrics/MetricsSystem/#starting","title":"Starting
                                                                                                                                    start(): Unit\n

                                                                                                                                    start turns <> flag on.

                                                                                                                                    NOTE: start can only be called once and <> an IllegalArgumentException when called multiple times.

                                                                                                                                    start <> the <> for Spark SQL, i.e. CodegenMetrics and HiveCatalogMetrics.

                                                                                                                                    start then registers the configured metrics <> and <> for the <>.

                                                                                                                                    In the end, start requests the registered <> to spark-metrics-Sink.md#start[start].

                                                                                                                                    [[start-IllegalArgumentException]] start throws an IllegalArgumentException when <> flag is on.

                                                                                                                                    requirement failed: Attempting to start a MetricsSystem that is already running\n
                                                                                                                                    ","text":""},{"location":"metrics/MetricsSystem/#logging","title":"Logging

                                                                                                                                    Enable ALL logging level for org.apache.spark.metrics.MetricsSystem logger to see what happens inside.

                                                                                                                                    Add the following line to conf/log4j.properties:

                                                                                                                                    log4j.logger.org.apache.spark.metrics.MetricsSystem=ALL\n

                                                                                                                                    Refer to Logging.

                                                                                                                                    ","text":""},{"location":"metrics/MetricsSystem/#internal-registries","title":"Internal Registries","text":""},{"location":"metrics/MetricsSystem/#metricregistry","title":"MetricRegistry

                                                                                                                                    Integration point to Dropwizard Metrics' MetricRegistry

                                                                                                                                    Used when MetricsSystem is requested to:

                                                                                                                                    • Register or remove a metrics source
                                                                                                                                    • Start (that in turn registers metrics sinks)
                                                                                                                                    ","text":""},{"location":"metrics/MetricsSystem/#metricsconfig","title":"MetricsConfig

                                                                                                                                    MetricsConfig

                                                                                                                                    Initialized when MetricsSystem is <>.

                                                                                                                                    Used when MetricsSystem registers <> and <>.","text":""},{"location":"metrics/MetricsSystem/#running-flag","title":"running Flag

                                                                                                                                    Indicates whether MetricsSystem has been started (true) or not (false)

                                                                                                                                    Default: false

                                                                                                                                    ","text":""},{"location":"metrics/MetricsSystem/#sinks","title":"sinks

                                                                                                                                    Metrics sinks

                                                                                                                                    Used when MetricsSystem <> and <>.","text":""},{"location":"metrics/MetricsSystem/#sources","title":"sources

                                                                                                                                    Metrics sources

                                                                                                                                    Used when MetricsSystem <>.","text":""},{"location":"metrics/PrometheusServlet/","title":"PrometheusServlet","text":"

                                                                                                                                    PrometheusServlet is a metrics sink that comes with a ServletContextHandler to serve metrics snapshots in Prometheus format.

                                                                                                                                    "},{"location":"metrics/PrometheusServlet/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                    PrometheusServlet takes the following to be created:

                                                                                                                                    • Properties
                                                                                                                                    • MetricRegistry (Dropwizard Metrics)

                                                                                                                                      PrometheusServlet is created when:

                                                                                                                                      • MetricsSystem is requested to register metric sinks (with sink.prometheusServlet configuration)
                                                                                                                                      "},{"location":"metrics/PrometheusServlet/#servletcontexthandler","title":"ServletContextHandler

                                                                                                                                      PrometheusServlet creates a ServletContextHandler to be registered at the path configured by path property.

                                                                                                                                      The ServletContextHandler handles text/plain content type.

                                                                                                                                      When executed, the ServletContextHandler gives a metrics snapshot.

                                                                                                                                      ","text":""},{"location":"metrics/PrometheusServlet/#metrics-snapshot","title":"Metrics Snapshot
                                                                                                                                      getMetricsSnapshot(\n  request: HttpServletRequest): String\n

                                                                                                                                      getMetricsSnapshot...FIXME

                                                                                                                                      ","text":""},{"location":"metrics/PrometheusServlet/#gethandlers","title":"getHandlers
                                                                                                                                      getHandlers(\n  conf: SparkConf): Array[ServletContextHandler]\n

                                                                                                                                      getHandlers is the ServletContextHandler.

                                                                                                                                      getHandlers is used when:

                                                                                                                                      • MetricsSystem is requested for servlet handlers
                                                                                                                                      ","text":""},{"location":"metrics/Sink/","title":"Sink","text":"

                                                                                                                                      Sink is a <> of metrics sinks.

                                                                                                                                      [[contract]] [source, scala]

                                                                                                                                      package org.apache.spark.metrics.sink

                                                                                                                                      trait Sink { def start(): Unit def stop(): Unit def report(): Unit }

                                                                                                                                      NOTE: Sink is a private[spark] contract.

                                                                                                                                      .Sink Contract [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                      | start | [[start]] Used when...FIXME

                                                                                                                                      | stop | [[stop]] Used when...FIXME

                                                                                                                                      | report | [[report]] Used when...FIXME |===

                                                                                                                                      [[implementations]] .Sinks [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Sink | Description

                                                                                                                                      | ConsoleSink | [[ConsoleSink]]

                                                                                                                                      | CsvSink | [[CsvSink]]

                                                                                                                                      | GraphiteSink | [[GraphiteSink]]

                                                                                                                                      | JmxSink | [[JmxSink]]

                                                                                                                                      | spark-metrics-MetricsServlet.md[MetricsServlet] | [[MetricsServlet]]

                                                                                                                                      | Slf4jSink | [[Slf4jSink]]

                                                                                                                                      | StatsdSink | [[StatsdSink]] |===

                                                                                                                                      NOTE: All known <> in Spark 2.3 are in org.apache.spark.metrics.sink Scala package."},{"location":"metrics/Source/","title":"Source","text":"

                                                                                                                                      Source is an abstraction of metrics sources.

                                                                                                                                      "},{"location":"metrics/Source/#contract","title":"Contract","text":""},{"location":"metrics/Source/#metricregistry","title":"MetricRegistry
                                                                                                                                      metricRegistry: MetricRegistry\n

                                                                                                                                      MetricRegistry (Codahale Metrics)

                                                                                                                                      Used when:

                                                                                                                                      • MetricsSystem is requested to register a metrics source
                                                                                                                                      ","text":""},{"location":"metrics/Source/#source-name","title":"Source Name
                                                                                                                                      sourceName: String\n

                                                                                                                                      Used when:

                                                                                                                                      • MetricsSystem is requested to build a metrics source identifier and getSourcesByName
                                                                                                                                      ","text":""},{"location":"metrics/Source/#implementations","title":"Implementations","text":"
                                                                                                                                      • AccumulatorSource
                                                                                                                                      • AppStatusSource
                                                                                                                                      • BlockManagerSource
                                                                                                                                      • DAGSchedulerSource
                                                                                                                                      • ExecutorAllocationManagerSource
                                                                                                                                      • ExecutorMetricsSource
                                                                                                                                      • ExecutorSource
                                                                                                                                      • JvmSource
                                                                                                                                      • ShuffleMetricsSource
                                                                                                                                      • others
                                                                                                                                      "},{"location":"metrics/configuration-properties/","title":"Configuration Properties","text":""},{"location":"metrics/configuration-properties/#sparkmetricsappstatussourceenabled","title":"spark.metrics.appStatusSource.enabled

                                                                                                                                      Enables Dropwizard/Codahale metrics with the status of a live Spark application

                                                                                                                                      Default: false

                                                                                                                                      Used when:

                                                                                                                                      • AppStatusSource utility is used to create an AppStatusSource
                                                                                                                                      ","text":""},{"location":"metrics/configuration-properties/#sparkmetricsconf","title":"spark.metrics.conf

                                                                                                                                      The metrics configuration file

                                                                                                                                      Default: metrics.properties

                                                                                                                                      ","text":""},{"location":"metrics/configuration-properties/#sparkmetricsexecutormetricssourceenabled","title":"spark.metrics.executorMetricsSource.enabled

                                                                                                                                      Enables registering ExecutorMetricsSource with the metrics system

                                                                                                                                      Default: true

                                                                                                                                      Used when:

                                                                                                                                      • SparkContext is created
                                                                                                                                      • Executor is created
                                                                                                                                      ","text":""},{"location":"metrics/configuration-properties/#sparkmetricsnamespace","title":"spark.metrics.namespace

                                                                                                                                      Root namespace for metrics reporting

                                                                                                                                      Default: Spark Application ID (i.e. spark.app.id configuration property)

                                                                                                                                      Since a Spark application's ID changes with every execution of a Spark application, a custom namespace can be specified for an easier metrics reporting.

                                                                                                                                      Used when MetricsSystem is requested for a metrics source identifier (metrics namespace)

                                                                                                                                      ","text":""},{"location":"metrics/configuration-properties/#sparkmetricsstaticsourcesenabled","title":"spark.metrics.staticSources.enabled

                                                                                                                                      Enables static metric sources

                                                                                                                                      Default: true

                                                                                                                                      Used when:

                                                                                                                                      • SparkContext is created
                                                                                                                                      • SparkEnv utility is used to create SparkEnv for executors
                                                                                                                                      ","text":""},{"location":"network/","title":"Network","text":""},{"location":"network/SparkTransportConf/","title":"SparkTransportConf Utility","text":""},{"location":"network/SparkTransportConf/#fromsparkconf","title":"fromSparkConf
                                                                                                                                      fromSparkConf(\n  _conf: SparkConf,\n  module: String, // (1)\n  numUsableCores: Int = 0,\n  role: Option[String] = None): TransportConf // (2)\n
                                                                                                                                      1. The given module is shuffle most of the time except:
                                                                                                                                        • rpc for NettyRpcEnv
                                                                                                                                        • files for NettyRpcEnv
                                                                                                                                      2. Only defined in NettyRpcEnv to be either driver or executor

                                                                                                                                      fromSparkConf makes a copy (clones) the given SparkConf.

                                                                                                                                      fromSparkConf sets the following configuration properties (for the given module):

                                                                                                                                      • spark.[module].io.serverThreads
                                                                                                                                      • spark.[module].io.clientThreads

                                                                                                                                      The values are taken using the following properties in the order and until one is found (with suffix being serverThreads or clientThreads, respectively):

                                                                                                                                      1. spark.[role].[module].io.[suffix]
                                                                                                                                      2. spark.[module].io.[suffix]

                                                                                                                                      Unless found, fromSparkConf defaults to the default number of threads (based on the given numUsableCores and not more than 8).

                                                                                                                                      In the end, fromSparkConf creates a TransportConf (for the given module and the updated SparkConf).

                                                                                                                                      fromSparkConf\u00a0is used when:

                                                                                                                                      • SparkEnv utility is used to create a SparkEnv (with the spark.shuffle.service.enabled configuration property enabled)
                                                                                                                                      • ExternalShuffleService is created
                                                                                                                                      • NettyBlockTransferService is requested to init
                                                                                                                                      • NettyRpcEnv is created and requested for a downloadClient
                                                                                                                                      • IndexShuffleBlockResolver is created
                                                                                                                                      • ShuffleBlockPusher is requested to initiateBlockPush
                                                                                                                                      • BlockManager is requested to readDiskBlockFromSameHostExecutor
                                                                                                                                      ","text":""},{"location":"network/TransportClientFactory/","title":"TransportClientFactory","text":""},{"location":"network/TransportClientFactory/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                      TransportClientFactory takes the following to be created:

                                                                                                                                      • TransportContext
                                                                                                                                      • TransportClientBootstraps

                                                                                                                                        TransportClientFactory is created\u00a0when:

                                                                                                                                        • TransportContext is requested for a TransportClientFactory
                                                                                                                                        "},{"location":"network/TransportClientFactory/#configuration-properties","title":"Configuration Properties","text":"

                                                                                                                                        While being created, TransportClientFactory requests the given TransportContext for the TransportConf that is used to access the values of the following (configuration) properties:

                                                                                                                                        • io.numConnectionsPerPeer
                                                                                                                                        • io.mode
                                                                                                                                        • io.mode
                                                                                                                                        • io.preferDirectBufs
                                                                                                                                        • io.retryWait
                                                                                                                                        • spark.network.sharedByteBufAllocators.enabled
                                                                                                                                        • spark.network.io.preferDirectBufs
                                                                                                                                        • Module Name
                                                                                                                                        "},{"location":"network/TransportClientFactory/#creating-transportclient","title":"Creating TransportClient
                                                                                                                                        TransportClient createClient(\n  String remoteHost,\n  int remotePort) // (1)\nTransportClient createClient(\n  String remoteHost,\n  int remotePort,\n  boolean fastFail)\nTransportClient createClient(\n  InetSocketAddress address)\n
                                                                                                                                        1. Turns fastFail off

                                                                                                                                        createClient prints out the following DEBUG message to the logs:

                                                                                                                                        Creating new connection to [address]\n

                                                                                                                                        createClient creates a Netty Bootstrap and initializes it.

                                                                                                                                        createClient requests the Netty Bootstrap to connect.

                                                                                                                                        If successful, createClient prints out the following DEBUG message and requests the TransportClientBootstraps to doBootstrap.

                                                                                                                                        Connection to [address] successful, running bootstraps...\n

                                                                                                                                        In the end, createClient prints out the following INFO message:

                                                                                                                                        Successfully created connection to [address] after [t] ms ([t] ms spent in bootstraps)\n
                                                                                                                                        ","text":""},{"location":"network/TransportConf/","title":"TransportConf","text":""},{"location":"network/TransportConf/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                        TransportConf takes the following to be created:

                                                                                                                                        • Module Name
                                                                                                                                        • ConfigProvider

                                                                                                                                          TransportConf is created\u00a0when:

                                                                                                                                          • SparkTransportConf utility is used to fromSparkConf
                                                                                                                                          • YarnShuffleService (Spark on YARN) is requested to serviceInit
                                                                                                                                          "},{"location":"network/TransportConf/#module-name","title":"Module Name

                                                                                                                                          TransportConf is given the name of a module the transport-related configuration properties are for and is as follows (per SparkTransportConf):

                                                                                                                                          • shuffle
                                                                                                                                          • rpc for NettyRpcEnv
                                                                                                                                          • files for NettyRpcEnv
                                                                                                                                          ","text":""},{"location":"network/TransportConf/#getmodulename","title":"getModuleName
                                                                                                                                          String getModuleName()\n

                                                                                                                                          getModuleName returns the module name.

                                                                                                                                          ","text":""},{"location":"network/TransportConf/#getconfkey","title":"getConfKey
                                                                                                                                          String getConfKey(\n  String suffix)\n

                                                                                                                                          getConfKey creates the key of a configuration property (with the module and the given suffix):

                                                                                                                                          spark.[module].[suffix]\n
                                                                                                                                          ","text":""},{"location":"network/TransportConf/#suffixes","title":"Suffixes","text":""},{"location":"network/TransportConf/#iomode","title":"io.mode
                                                                                                                                          • nio (default)
                                                                                                                                          • epoll
                                                                                                                                          ","text":""},{"location":"network/TransportConf/#iopreferdirectbufs","title":"io.preferDirectBufs

                                                                                                                                          Controls whether Spark prefers allocating off-heap byte buffers within Netty (true) or not (false).

                                                                                                                                          Default: true

                                                                                                                                          ","text":""},{"location":"network/TransportConf/#ioconnectiontimeout","title":"io.connectionTimeout","text":""},{"location":"network/TransportConf/#ioconnectioncreationtimeout","title":"io.connectionCreationTimeout","text":""},{"location":"network/TransportConf/#iobacklog","title":"io.backLog

                                                                                                                                          The requested maximum length of the queue of incoming connections

                                                                                                                                          Default: -1 (no backlog)

                                                                                                                                          ","text":""},{"location":"network/TransportConf/#ionumconnectionsperpeer","title":"io.numConnectionsPerPeer

                                                                                                                                          Default: 1

                                                                                                                                          ","text":""},{"location":"network/TransportConf/#ioserverthreads","title":"io.serverThreads","text":""},{"location":"network/TransportConf/#ioclientthreads","title":"io.clientThreads

                                                                                                                                          Default: 0

                                                                                                                                          ","text":""},{"location":"network/TransportConf/#ioreceivebuffer","title":"io.receiveBuffer","text":""},{"location":"network/TransportConf/#iosendbuffer","title":"io.sendBuffer","text":""},{"location":"network/TransportConf/#sasltimeout","title":"sasl.timeout","text":""},{"location":"network/TransportConf/#iomaxretries","title":"io.maxRetries","text":""},{"location":"network/TransportConf/#ioretrywait","title":"io.retryWait

                                                                                                                                          Time that we will wait in order to perform a retry after an IOException. Only relevant if maxIORetries is greater than 0.

                                                                                                                                          Default: 5s

                                                                                                                                          ","text":""},{"location":"network/TransportConf/#iolazyfd","title":"io.lazyFD","text":""},{"location":"network/TransportConf/#ioenableverbosemetrics","title":"io.enableVerboseMetrics

                                                                                                                                          Enables Netty's memory detailed metrics

                                                                                                                                          Default: false

                                                                                                                                          ","text":""},{"location":"network/TransportConf/#ioenabletcpkeepalive","title":"io.enableTcpKeepAlive","text":""},{"location":"network/TransportConf/#preferdirectbufsforsharedbytebufallocators","title":"preferDirectBufsForSharedByteBufAllocators

                                                                                                                                          The value of spark.network.io.preferDirectBufs.

                                                                                                                                          ","text":""},{"location":"network/TransportConf/#sharedbytebufallocators","title":"sharedByteBufAllocators

                                                                                                                                          The value of spark.network.sharedByteBufAllocators.enabled.

                                                                                                                                          ","text":""},{"location":"network/TransportContext/","title":"TransportContext","text":""},{"location":"network/TransportContext/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                          TransportContext takes the following to be created:

                                                                                                                                          • TransportConf
                                                                                                                                          • RpcHandler
                                                                                                                                          • closeIdleConnections flag
                                                                                                                                          • isClientOnly flag

                                                                                                                                            TransportContext is created\u00a0when:

                                                                                                                                            • ExternalBlockStoreClient is requested to init
                                                                                                                                            • ExternalShuffleService is requested to start
                                                                                                                                            • NettyBlockTransferService is requested to init
                                                                                                                                            • NettyRpcEnv is created and requested to downloadClient
                                                                                                                                            • YarnShuffleService (Spark on YARN) is requested to serviceInit
                                                                                                                                            "},{"location":"network/TransportContext/#creating-server","title":"Creating Server
                                                                                                                                            TransportServer createServer(\n  int port,\n  List<TransportServerBootstrap> bootstraps)\nTransportServer createServer(\n  String host,\n  int port,\n  List<TransportServerBootstrap> bootstraps)\n

                                                                                                                                            createServer creates a TransportServer (with the RpcHandler and the input arguments).

                                                                                                                                            createServer\u00a0is used when:

                                                                                                                                            • YarnShuffleService (Spark on YARN) is requested to serviceInit
                                                                                                                                            • ExternalShuffleService is requested to start
                                                                                                                                            • NettyBlockTransferService is requested to createServer
                                                                                                                                            • NettyRpcEnv is requested to startServer
                                                                                                                                            ","text":""},{"location":"network/TransportContext/#creating-transportclientfactory","title":"Creating TransportClientFactory
                                                                                                                                            TransportClientFactory createClientFactory() // (1)\nTransportClientFactory createClientFactory(\n  List<TransportClientBootstrap> bootstraps)\n
                                                                                                                                            1. Uses empty bootstraps

                                                                                                                                            createClientFactory creates a TransportClientFactory (with itself and the given TransportClientBootstraps).

                                                                                                                                            createClientFactory\u00a0is used when:

                                                                                                                                            • ExternalBlockStoreClient is requested to init
                                                                                                                                            • NettyBlockTransferService is requested to init
                                                                                                                                            • NettyRpcEnv is created and requested to downloadClient
                                                                                                                                            ","text":""},{"location":"plugins/","title":"Plugin Framework","text":"

                                                                                                                                            Plugin Framework is an API for registering custom extensions (plugins) to be executed on the driver and executors.

                                                                                                                                            Plugin Framework uses separate PluginContainers for the driver and executors, and spark.plugins configuration property for SparkPlugins to be registered.

                                                                                                                                            Plugin Framework was introduced in Spark 2.4.4 (with an API for executors) with further changes in Spark 3.0.0 (to cover the driver).

                                                                                                                                            "},{"location":"plugins/#resources","title":"Resources","text":"
                                                                                                                                            • Advanced Instrumentation in the official documentation of Apache Spark
                                                                                                                                            • Commit for SPARK-29397
                                                                                                                                            • Spark Plugin Framework in 3.0 - Part 1: Introduction by Madhukara Phatak
                                                                                                                                            • Spark Memory Monitor by squito
                                                                                                                                            • SparkPlugins by Luca Canali (CERN)
                                                                                                                                            "},{"location":"plugins/DriverPlugin/","title":"DriverPlugin","text":"

                                                                                                                                            DriverPlugin is...FIXME

                                                                                                                                            "},{"location":"plugins/DriverPluginContainer/","title":"DriverPluginContainer","text":"

                                                                                                                                            DriverPluginContainer is a PluginContainer.

                                                                                                                                            "},{"location":"plugins/DriverPluginContainer/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                            DriverPluginContainer takes the following to be created:

                                                                                                                                            • SparkContext
                                                                                                                                            • Resources (Map[String, ResourceInformation])
                                                                                                                                            • SparkPlugins

                                                                                                                                              DriverPluginContainer is created\u00a0when:

                                                                                                                                              • PluginContainer utility is used for a PluginContainer (at SparkContext startup)
                                                                                                                                              "},{"location":"plugins/DriverPluginContainer/#registering-metrics","title":"Registering Metrics
                                                                                                                                              registerMetrics(\n  appId: String): Unit\n

                                                                                                                                              registerMetrics\u00a0is part of the PluginContainer abstraction.

                                                                                                                                              For every driver plugin, registerMetrics requests it to register metrics and the associated PluginContextImpl for the same.

                                                                                                                                              ","text":""},{"location":"plugins/DriverPluginContainer/#logging","title":"Logging

                                                                                                                                              Enable ALL logging level for org.apache.spark.internal.plugin.DriverPluginContainer logger to see what happens inside.

                                                                                                                                              Add the following line to conf/log4j.properties:

                                                                                                                                              log4j.logger.org.apache.spark.internal.plugin.DriverPluginContainer=ALL\n

                                                                                                                                              Refer to Logging.

                                                                                                                                              ","text":""},{"location":"plugins/ExecutorPlugin/","title":"ExecutorPlugin","text":"

                                                                                                                                              ExecutorPlugin is...FIXME

                                                                                                                                              "},{"location":"plugins/ExecutorPluginContainer/","title":"ExecutorPluginContainer","text":"

                                                                                                                                              ExecutorPluginContainer is a PluginContainer for Executors.

                                                                                                                                              "},{"location":"plugins/ExecutorPluginContainer/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                              ExecutorPluginContainer takes the following to be created:

                                                                                                                                              • SparkEnv
                                                                                                                                              • Resources (Map[String, ResourceInformation])
                                                                                                                                              • SparkPlugins

                                                                                                                                                ExecutorPluginContainer is created when:

                                                                                                                                                • PluginContainer utility is used to create a PluginContainer (for Executors)
                                                                                                                                                "},{"location":"plugins/ExecutorPluginContainer/#executorplugins","title":"ExecutorPlugins

                                                                                                                                                ExecutorPluginContainer initializes executorPlugins internal registry of ExecutorPlugins when created.

                                                                                                                                                ","text":""},{"location":"plugins/ExecutorPluginContainer/#initialization","title":"Initialization","text":"

                                                                                                                                                executorPlugins finds all the configuration properties with spark.plugins.internal.conf. prefix (in the SparkConf) for extra configuration of every ExecutorPlugin of the given SparkPlugins.

                                                                                                                                                For every SparkPlugin (in the given SparkPlugins) that defines an ExecutorPlugin, executorPlugins creates a PluginContextImpl, requests the ExecutorPlugin to init (with the PluginContextImpl and the extra configuration) and the PluginContextImpl to registerMetrics.

                                                                                                                                                In the end, executorPlugins prints out the following INFO message to the logs (for every ExecutorPlugin):

                                                                                                                                                Initialized executor component for plugin [name].\n
                                                                                                                                                "},{"location":"plugins/ExecutorPluginContainer/#logging","title":"Logging

                                                                                                                                                Enable ALL logging level for org.apache.spark.internal.plugin.ExecutorPluginContainer logger to see what happens inside.

                                                                                                                                                Add the following line to conf/log4j.properties:

                                                                                                                                                log4j.logger.org.apache.spark.internal.plugin.ExecutorPluginContainer=ALL\n

                                                                                                                                                Refer to Logging.

                                                                                                                                                ","text":""},{"location":"plugins/PluginContainer/","title":"PluginContainer","text":"

                                                                                                                                                PluginContainer is an abstraction of plugin containers that can register metrics (for the driver and executors).

                                                                                                                                                PluginContainer is created for the driver and executors using apply utility.

                                                                                                                                                "},{"location":"plugins/PluginContainer/#contract","title":"Contract","text":""},{"location":"plugins/PluginContainer/#listening-to-task-failures","title":"Listening to Task Failures
                                                                                                                                                onTaskFailed(\n  failureReason: TaskFailedReason): Unit\n

                                                                                                                                                For ExecutorPluginContainer only

                                                                                                                                                Possible TaskFailedReasons:

                                                                                                                                                • TaskKilledException
                                                                                                                                                • TaskKilled
                                                                                                                                                • FetchFailed
                                                                                                                                                • TaskCommitDenied
                                                                                                                                                • ExceptionFailure

                                                                                                                                                Used when:

                                                                                                                                                • TaskRunner is requested to run (and the task has failed)
                                                                                                                                                ","text":""},{"location":"plugins/PluginContainer/#listening-to-task-start","title":"Listening to Task Start
                                                                                                                                                onTaskStart(): Unit\n

                                                                                                                                                For ExecutorPluginContainer only

                                                                                                                                                Used when:

                                                                                                                                                • TaskRunner is requested to run (and the task has just started)
                                                                                                                                                ","text":""},{"location":"plugins/PluginContainer/#listening-to-task-success","title":"Listening to Task Success
                                                                                                                                                onTaskSucceeded(): Unit\n

                                                                                                                                                For ExecutorPluginContainer only

                                                                                                                                                Used when:

                                                                                                                                                • TaskRunner is requested to run (and the task has finished successfully)
                                                                                                                                                ","text":""},{"location":"plugins/PluginContainer/#registering-metrics","title":"Registering Metrics
                                                                                                                                                registerMetrics(\n  appId: String): Unit\n

                                                                                                                                                Registers metrics for the application ID

                                                                                                                                                For DriverPluginContainer only

                                                                                                                                                Used when:

                                                                                                                                                • SparkContext is created
                                                                                                                                                ","text":""},{"location":"plugins/PluginContainer/#shutdown","title":"Shutdown
                                                                                                                                                shutdown(): Unit\n

                                                                                                                                                Used when:

                                                                                                                                                • SparkContext is requested to stop
                                                                                                                                                • Executor is requested to stop
                                                                                                                                                ","text":""},{"location":"plugins/PluginContainer/#implementations","title":"Implementations","text":"Sealed Abstract Class

                                                                                                                                                PluginContainer is a Scala sealed abstract class which means that all of the implementations are in the same compilation unit (a single file).

                                                                                                                                                • DriverPluginContainer
                                                                                                                                                • ExecutorPluginContainer
                                                                                                                                                "},{"location":"plugins/PluginContainer/#creating-plugincontainer","title":"Creating PluginContainer
                                                                                                                                                // the driver\napply(\n  sc: SparkContext,\n  resources: java.util.Map[String, ResourceInformation]): Option[PluginContainer]\n// executors\napply(\n  env: SparkEnv,\n  resources: java.util.Map[String, ResourceInformation]): Option[PluginContainer]\n// private helper\napply(\n  ctx: Either[SparkContext, SparkEnv],\n  resources: java.util.Map[String, ResourceInformation]): Option[PluginContainer]\n

                                                                                                                                                apply creates a PluginContainer for the driver or executors (based on the type of the first input argument, i.e. SparkContext or SparkEnv, respectively).

                                                                                                                                                apply first loads the SparkPlugins defined by spark.plugins configuration property.

                                                                                                                                                Only when there was at least one plugin loaded, apply creates a DriverPluginContainer or ExecutorPluginContainer.

                                                                                                                                                apply is used when:

                                                                                                                                                • SparkContext is created
                                                                                                                                                • Executor is created
                                                                                                                                                ","text":""},{"location":"plugins/PluginContextImpl/","title":"PluginContextImpl","text":"

                                                                                                                                                PluginContextImpl is...FIXME

                                                                                                                                                "},{"location":"plugins/SparkPlugin/","title":"SparkPlugin","text":"

                                                                                                                                                SparkPlugin is an abstraction of custom extensions for Spark applications.

                                                                                                                                                ","tags":["DeveloperApi"]},{"location":"plugins/SparkPlugin/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"plugins/SparkPlugin/#driver-side-component","title":"Driver-side Component
                                                                                                                                                DriverPlugin driverPlugin()\n

                                                                                                                                                Used when:

                                                                                                                                                • DriverPluginContainer is created
                                                                                                                                                ","text":"","tags":["DeveloperApi"]},{"location":"plugins/SparkPlugin/#executor-side-component","title":"Executor-side Component
                                                                                                                                                ExecutorPlugin executorPlugin()\n

                                                                                                                                                Used when:

                                                                                                                                                • ExecutorPluginContainer is created
                                                                                                                                                ","text":"","tags":["DeveloperApi"]},{"location":"rdd/","title":"Resilient Distributed Dataset (RDD)","text":"

                                                                                                                                                Resilient Distributed Dataset (aka RDD) is the primary data abstraction in Apache Spark and the core of Spark (that I often refer to as Spark Core).

                                                                                                                                                The origins of RDD

                                                                                                                                                The original paper that gave birth to the concept of RDD is Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia, et al.

                                                                                                                                                Read the paper and skip the rest of this page. You'll save a great deal of your precious time \ud83d\ude0e

                                                                                                                                                An RDD is a description of a fault-tolerant and resilient computation over a distributed collection of records (spread over one or many partitions).

                                                                                                                                                RDDs and Scala Collections

                                                                                                                                                RDDs are like Scala collections, and they only differ by their distribution, i.e. a RDD is computed on many JVMs while a Scala collection lives on a single JVM.

                                                                                                                                                Using RDD Spark hides data partitioning and so distribution that in turn allowed them to design parallel computational framework with a higher-level programming interface (API) for four mainstream programming languages.

                                                                                                                                                The features of RDDs (decomposing the name):

                                                                                                                                                • Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
                                                                                                                                                • Distributed with data residing on multiple nodes in a Spark cluster
                                                                                                                                                • Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

                                                                                                                                                From the scaladoc of org.apache.spark.rdd.RDD:

                                                                                                                                                A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

                                                                                                                                                From the original paper about RDD - Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing:

                                                                                                                                                Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

                                                                                                                                                Beside the above traits (that are directly embedded in the name of the data abstraction - RDD) it has the following additional traits:

                                                                                                                                                • In-Memory, i.e. data inside RDD is stored in memory as much (size) and long (time) as possible.
                                                                                                                                                • Immutable or Read-Only, i.e. it does not change once created and can only be transformed using transformations to new RDDs.
                                                                                                                                                • Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action is executed that triggers the execution.
                                                                                                                                                • Cacheable, i.e. you can hold all the data in a persistent \"storage\" like memory (default and the most preferred) or disk (the least preferred due to access speed).
                                                                                                                                                • Parallel, i.e. process data in parallel.
                                                                                                                                                • Typed -- RDD records have types, e.g. Long in RDD[Long] or (Int, String) in RDD[(Int, String)].
                                                                                                                                                • Partitioned -- records are partitioned (split into logical partitions) and distributed across nodes in a cluster.
                                                                                                                                                • Location-Stickiness -- RDD can define <> to compute partitions (as close to the records as possible).

                                                                                                                                                  Note

                                                                                                                                                  Preferred location (aka locality preferences or placement preferences or locality info) is information about the locations of RDD records (that Spark's DAGScheduler uses to place computing partitions on to have the tasks as close to the data as possible).

                                                                                                                                                  Computing partitions in a RDD is a distributed process by design and to achieve even data distribution as well as leverage data locality (in distributed systems like HDFS or Apache Kafka in which data is partitioned by default), they are partitioned to a fixed number of partitions - logical chunks (parts) of data. The logical division is for processing only and internally it is not divided whatsoever. Each partition comprises of records.

                                                                                                                                                  Partitions are the units of parallelism. You can control the number of partitions of a RDD using RDD.repartition or RDD.coalesce transformations. Spark tries to be as close to data as possible without wasting time to send data across network by means of RDD shuffling, and creates as many partitions as required to follow the storage layout and thus optimize data access. It leads to a one-to-one mapping between (physical) data in distributed data storage (e.g., HDFS or Cassandra) and partitions.

                                                                                                                                                  RDDs support two kinds of operations:

                                                                                                                                                  • transformations - lazy operations that return another RDD.
                                                                                                                                                  • actions - operations that trigger computation and return values.

                                                                                                                                                  The motivation to create RDD were (after the authors) two types of applications that current computing frameworks handle inefficiently:

                                                                                                                                                  • iterative algorithms in machine learning and graph computations
                                                                                                                                                  • interactive data mining tools as ad-hoc queries on the same dataset

                                                                                                                                                  The goal is to reuse intermediate in-memory results across multiple data-intensive workloads with no need for copying large amounts of data over the network.

                                                                                                                                                  Technically, RDDs follow the contract defined by the five main intrinsic properties:

                                                                                                                                                  • Parent RDDs (aka RDD dependencies)
                                                                                                                                                  • An array of partitions that a dataset is divided to
                                                                                                                                                  • A compute function to do a computation on partitions
                                                                                                                                                  • An optional Partitioner that defines how keys are hashed, and the pairs partitioned (for key-value RDDs)
                                                                                                                                                  • Optional preferred locations (aka locality info), i.e. hosts for a partition where the records live or are the closest to read from

                                                                                                                                                  This RDD abstraction supports an expressive set of operations without having to modify scheduler for each one.

                                                                                                                                                  An RDD is a named (by name) and uniquely identified (by id) entity in a SparkContext (available as context property).

                                                                                                                                                  RDDs live in one and only one SparkContext that creates a logical boundary.

                                                                                                                                                  Note

                                                                                                                                                  RDDs cannot be shared between SparkContexts.

                                                                                                                                                  An RDD can optionally have a friendly name accessible using name that can be changed using =:

                                                                                                                                                  scala> val ns = sc.parallelize(0 to 10)\nns: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24\n\nscala> ns.id\nres0: Int = 2\n\nscala> ns.name\nres1: String = null\n\nscala> ns.name = \"Friendly name\"\nns.name: String = Friendly name\n\nscala> ns.name\nres2: String = Friendly name\n\nscala> ns.toDebugString\nres3: String = (8) Friendly name ParallelCollectionRDD[2] at parallelize at <console>:24 []\n

                                                                                                                                                  RDDs are a container of instructions on how to materialize big (arrays of) distributed data, and how to split it into partitions so Spark (using executors) can hold some of them.

                                                                                                                                                  In general data distribution can help executing processing in parallel so a task processes a chunk of data that it could eventually keep in memory.

                                                                                                                                                  Spark does jobs in parallel, and RDDs are split into partitions to be processed and written in parallel. Inside a partition, data is processed sequentially.

                                                                                                                                                  Saving partitions results in part-files instead of one single file (unless there is a single partition).

                                                                                                                                                  "},{"location":"rdd/#transformations","title":"Transformations","text":"

                                                                                                                                                  A transformation is a lazy operation on a RDD that returns another RDD (e.g., map, flatMap, filter, reduceByKey, join, cogroup, etc.)

                                                                                                                                                  Learn more in Transformations.

                                                                                                                                                  "},{"location":"rdd/#actions","title":"Actions","text":"

                                                                                                                                                  An action is an operation that triggers execution of RDD transformations and returns a value (to a Spark driver - the user program).

                                                                                                                                                  Learn more in Actions.

                                                                                                                                                  "},{"location":"rdd/#creating-rdds","title":"Creating RDDs","text":""},{"location":"rdd/#parallelize","title":"SparkContext.parallelize","text":"

                                                                                                                                                  One way to create a RDD is with SparkContext.parallelize method. It accepts a collection of elements as shown below (sc is a SparkContext instance):

                                                                                                                                                  scala> val rdd = sc.parallelize(1 to 1000)\nrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:25\n

                                                                                                                                                  You may also want to randomize the sample data:

                                                                                                                                                  scala> val data = Seq.fill(10)(util.Random.nextInt)\ndata: Seq[Int] = List(-964985204, 1662791, -1820544313, -383666422, -111039198, 310967683, 1114081267, 1244509086, 1797452433, 124035586)\n\nscala> val rdd = sc.parallelize(data)\nrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:29\n

                                                                                                                                                  Given the reason to use Spark to process more data than your own laptop could handle, SparkContext.parallelize is mainly used to learn Spark in the Spark shell.

                                                                                                                                                  SparkContext.parallelize requires all the data to be available on a single machine - the Spark driver - that eventually hits the limits of your laptop.

                                                                                                                                                  "},{"location":"rdd/#makeRDD","title":"SparkContext.makeRDD","text":"
                                                                                                                                                  scala> sc.makeRDD(0 to 1000)\nres0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at <console>:25\n
                                                                                                                                                  "},{"location":"rdd/#textFile","title":"SparkContext.textFile","text":"

                                                                                                                                                  One of the easiest ways to create an RDD is to use SparkContext.textFile to read files.

                                                                                                                                                  You can use the local README.md file (and then flatMap over the lines inside to have an RDD of words):

                                                                                                                                                  scala> val words = sc.textFile(\"README.md\").flatMap(_.split(\"\\\\W+\")).cache\nwords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[27] at flatMap at <console>:24\n

                                                                                                                                                  Note

                                                                                                                                                  You cache it so the computation is not performed every time you work with words.

                                                                                                                                                  "},{"location":"rdd/#rdds-in-web-ui","title":"RDDs in Web UI","text":"

                                                                                                                                                  It is quite informative to look at RDDs in the Web UI that is at http://localhost:4040 for Spark shell.

                                                                                                                                                  Execute the following Spark application (type all the lines in spark-shell):

                                                                                                                                                  val ints = sc.parallelize(1 to 100) // (1)!\nints.setName(\"Hundred ints\")        // (2)!\nints.cache                          // (3)!\nints.count                          // (4)!\n
                                                                                                                                                  1. Creates an RDD with hundred of numbers (with as many partitions as possible)
                                                                                                                                                  2. Sets the name of the RDD
                                                                                                                                                  3. Caches the RDD for performance reasons that also makes it visible in Storage tab in the web UI
                                                                                                                                                  4. Executes action (and materializes the RDD)

                                                                                                                                                  With the above executed, you should see the following in the Web UI:

                                                                                                                                                  Click the name of the RDD (under RDD Name) and you will get the details of how the RDD is cached.

                                                                                                                                                  Execute the following Spark job and you will see how the number of partitions decreases.

                                                                                                                                                  ints.repartition(2).count\n

                                                                                                                                                  "},{"location":"rdd/Aggregator/","title":"Aggregator","text":"

                                                                                                                                                  Aggregator is a set of <> used to aggregate data using rdd:PairRDDFunctions.md#combineByKeyWithClassTag[PairRDDFunctions.combineByKeyWithClassTag] transformation.

                                                                                                                                                  Aggregator[K, V, C] is a parameterized type of K keys, V values, and C combiner (partial) values.

                                                                                                                                                  [[creating-instance]][[aggregation-functions]] Aggregator transforms an RDD[(K, V)] into an RDD[(K, C)] (for a \"combined type\" C) using the functions:

                                                                                                                                                  • [[createCombiner]] createCombiner: V => C
                                                                                                                                                  • [[mergeValue]] mergeValue: (C, V) => C
                                                                                                                                                  • [[mergeCombiners]] mergeCombiners: (C, C) => C

                                                                                                                                                  Aggregator is used to create a ShuffleDependency and ExternalSorter.

                                                                                                                                                  == [[combineValuesByKey]] combineValuesByKey Method

                                                                                                                                                  "},{"location":"rdd/Aggregator/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                  combineValuesByKey( iter: Iterator[_ <: Product2[K, V]], context: TaskContext): Iterator[(K, C)]

                                                                                                                                                  combineValuesByKey creates a new shuffle:ExternalAppendOnlyMap.md[ExternalAppendOnlyMap] (with the <>).

                                                                                                                                                  combineValuesByKey requests the ExternalAppendOnlyMap to shuffle:ExternalAppendOnlyMap.md#insertAll[insert all key-value pairs] from the given iterator (that is the values of a partition).

                                                                                                                                                  combineValuesByKey <>.

                                                                                                                                                  In the end, combineValuesByKey requests the ExternalAppendOnlyMap for an shuffle:ExternalAppendOnlyMap.md#iterator[iterator of \"combined\" pairs].

                                                                                                                                                  combineValuesByKey is used when:

                                                                                                                                                  • rdd:PairRDDFunctions.md#combineByKeyWithClassTag[PairRDDFunctions.combineByKeyWithClassTag] transformation is used (with the same Partitioner as the RDD's)

                                                                                                                                                  • BlockStoreShuffleReader is requested to shuffle:BlockStoreShuffleReader.md#read[read combined records for a reduce task] (with the Map-Size Partial Aggregation Flag off)

                                                                                                                                                  == [[combineCombinersByKey]] combineCombinersByKey Method

                                                                                                                                                  "},{"location":"rdd/Aggregator/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                  combineCombinersByKey( iter: Iterator[_ <: Product2[K, C]], context: TaskContext): Iterator[(K, C)]

                                                                                                                                                  combineCombinersByKey...FIXME

                                                                                                                                                  combineCombinersByKey is used when BlockStoreShuffleReader is requested to shuffle:BlockStoreShuffleReader.md#read[read combined records for a reduce task] (with the Map-Size Partial Aggregation Flag on).

                                                                                                                                                  == [[updateMetrics]] Updating Task Metrics

                                                                                                                                                  "},{"location":"rdd/Aggregator/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                  updateMetrics( context: TaskContext, map: ExternalAppendOnlyMap[_, _, _]): Unit

                                                                                                                                                  updateMetrics requests the input TaskContext for the TaskMetrics to update the metrics based on the metrics of the input ExternalAppendOnlyMap:

                                                                                                                                                  • executor:TaskMetrics.md#incMemoryBytesSpilled[Increment memory bytes spilled]

                                                                                                                                                  • executor:TaskMetrics.md#incDiskBytesSpilled[Increment disk bytes spilled]

                                                                                                                                                  • executor:TaskMetrics.md#incPeakExecutionMemory[Increment peak execution memory]

                                                                                                                                                  updateMetrics is used when Aggregator is requested to <> and <>."},{"location":"rdd/AsyncRDDActions/","title":"AsyncRDDActions","text":"

                                                                                                                                                  AsyncRDDActions is...FIXME

                                                                                                                                                  "},{"location":"rdd/CheckpointRDD/","title":"CheckpointRDD","text":"

                                                                                                                                                  CheckpointRDD is an extension of the RDD abstraction for RDDs that recovers checkpointed data from storage.

                                                                                                                                                  CheckpointRDD cannot be checkpointed again (and doCheckpoint, checkpoint, and localCheckpoint are simply noops).

                                                                                                                                                  getPartitions and compute throw an NotImplementedError and are supposed to be overriden by the implementations.

                                                                                                                                                  "},{"location":"rdd/CheckpointRDD/#implementations","title":"Implementations","text":"
                                                                                                                                                  • LocalCheckpointRDD
                                                                                                                                                  • ReliableCheckpointRDD
                                                                                                                                                  "},{"location":"rdd/CoGroupedRDD/","title":"CoGroupedRDD","text":"

                                                                                                                                                  CoGroupedRDD[K] is an RDD that cogroups the parent RDDs.

                                                                                                                                                  RDD[(K, Array[Iterable[_]])]\n

                                                                                                                                                  For each key k in parent RDDs, the resulting RDD contains a tuple with the list of values for that key.

                                                                                                                                                  "},{"location":"rdd/CoGroupedRDD/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                  CoGroupedRDD takes the following to be created:

                                                                                                                                                  • Key-Value RDDs (Seq[RDD[_ <: Product2[K, _]]])
                                                                                                                                                  • Partitioner

                                                                                                                                                    CoGroupedRDD is created\u00a0when:

                                                                                                                                                    • RDD.cogroup operator is used
                                                                                                                                                    "},{"location":"rdd/CoalescedRDD/","title":"CoalescedRDD","text":"

                                                                                                                                                    CoalescedRDD is...FIXME

                                                                                                                                                    "},{"location":"rdd/Dependency/","title":"Dependency","text":"

                                                                                                                                                    Dependency[T] is an abstraction of dependencies between RDDs.

                                                                                                                                                    Any time an RDD transformation (e.g. map, flatMap) is used (and RDD lineage graph is built), Dependencyies are the edges.

                                                                                                                                                    ","tags":["DeveloperApi"]},{"location":"rdd/Dependency/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"rdd/Dependency/#rdd","title":"RDD
                                                                                                                                                    rdd: RDD[T]\n

                                                                                                                                                    Used when:

                                                                                                                                                    • DAGScheduler is requested for the shuffle dependencies and ResourceProfiles (of an RDD)
                                                                                                                                                    • RDD is requested to getNarrowAncestors, cleanShuffleDependencies, firstParent, parent, toDebugString, getOutputDeterministicLevel
                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"rdd/Dependency/#implementations","title":"Implementations","text":"
                                                                                                                                                    • NarrowDependency
                                                                                                                                                    • ShuffleDependency
                                                                                                                                                    ","tags":["DeveloperApi"]},{"location":"rdd/Dependency/#demo","title":"Demo","text":"

                                                                                                                                                    The dependencies of an RDD are available using RDD.dependencies method.

                                                                                                                                                    val myRdd = sc.parallelize(0 to 9).groupBy(_ % 2)\n
                                                                                                                                                    scala> myRdd.dependencies.foreach(println)\norg.apache.spark.ShuffleDependency@41e38d89\n
                                                                                                                                                    scala> myRdd.dependencies.map(_.rdd).foreach(println)\nMapPartitionsRDD[6] at groupBy at <console>:39\n

                                                                                                                                                    RDD.toDebugString is used to print out the RDD lineage in a developer-friendly way.

                                                                                                                                                    scala> println(myRdd.toDebugString)\n(16) ShuffledRDD[7] at groupBy at <console>:39 []\n +-(16) MapPartitionsRDD[6] at groupBy at <console>:39 []\n    |   ParallelCollectionRDD[5] at parallelize at <console>:39 []\n
                                                                                                                                                    ","tags":["DeveloperApi"]},{"location":"rdd/HadoopRDD/","title":"HadoopRDD","text":"

                                                                                                                                                    https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.HadoopRDD[HadoopRDD] is an RDD that provides core functionality for reading data stored in HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI using the older MapReduce API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/package-summary.html[org.apache.hadoop.mapred]).

                                                                                                                                                    HadoopRDD is created as a result of calling the following methods in SparkContext.md[]:

                                                                                                                                                    • hadoopFile
                                                                                                                                                    • textFile (the most often used in examples!)
                                                                                                                                                    • sequenceFile

                                                                                                                                                    Partitions are of type HadoopPartition.

                                                                                                                                                    When an HadoopRDD is computed, i.e. an action is called, you should see the INFO message Input split: in the logs.

                                                                                                                                                    scala> sc.textFile(\"README.md\").count\n...\n15/10/10 18:03:21 INFO HadoopRDD: Input split: file:/Users/jacek/dev/oss/spark/README.md:0+1784\n15/10/10 18:03:21 INFO HadoopRDD: Input split: file:/Users/jacek/dev/oss/spark/README.md:1784+1784\n...\n

                                                                                                                                                    The following properties are set upon partition execution:

                                                                                                                                                    • mapred.tip.id - task id of this task's attempt
                                                                                                                                                    • mapred.task.id - task attempt's id
                                                                                                                                                    • mapred.task.is.map as true
                                                                                                                                                    • mapred.task.partition - split id
                                                                                                                                                    • mapred.job.id

                                                                                                                                                    Spark settings for HadoopRDD:

                                                                                                                                                    • spark.hadoop.cloneConf (default: false) - shouldCloneJobConf - should a Hadoop job configuration JobConf object be cloned before spawning a Hadoop job. Refer to https://issues.apache.org/jira/browse/SPARK-2546[[SPARK-2546] Configuration object thread safety issue]. When true, you should see a DEBUG message Cloning Hadoop Configuration.

                                                                                                                                                    You can register callbacks on TaskContext.

                                                                                                                                                    HadoopRDDs are not checkpointed. They do nothing when checkpoint() is called.

                                                                                                                                                    "},{"location":"rdd/HadoopRDD/#caution","title":"[CAUTION]","text":"

                                                                                                                                                    FIXME

                                                                                                                                                    • What are InputMetrics?
                                                                                                                                                    • What is JobConf?
                                                                                                                                                    • What are the InputSplits: FileSplit and CombineFileSplit? * What are InputFormat and Configurable subtypes?
                                                                                                                                                    • What's InputFormat's RecordReader? It creates a key and a value. What are they?

                                                                                                                                                    === [[getPreferredLocations]] getPreferredLocations Method

                                                                                                                                                    CAUTION: FIXME

                                                                                                                                                    === [[getPartitions]] getPartitions Method

                                                                                                                                                    The number of partition for HadoopRDD, i.e. the return value of getPartitions, is calculated using InputFormat.getSplits(jobConf, minPartitions) where minPartitions is only a hint of how many partitions one may want at minimum. As a hint it does not mean the number of partitions will be exactly the number given.

                                                                                                                                                    For SparkContext.textFile the input format class is https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html[org.apache.hadoop.mapred.TextInputFormat].

                                                                                                                                                    The https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html[javadoc of org.apache.hadoop.mapred.FileInputFormat] says:

                                                                                                                                                    FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobConf, int). Subclasses of FileInputFormat can also override the isSplitable(FileSystem, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.

                                                                                                                                                    TIP: You may find https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L319[the sources of org.apache.hadoop.mapred.FileInputFormat.getSplits] enlightening.

                                                                                                                                                    "},{"location":"rdd/HadoopRDD/#whats-hadoop-split-input-splits-for-hadoop-reads-see-inputformatgetsplits","title":"What's Hadoop Split? input splits for Hadoop reads? See InputFormat.getSplits","text":""},{"location":"rdd/HashPartitioner/","title":"HashPartitioner","text":"

                                                                                                                                                    HashPartitioner is a Partitioner for hash-based partitioning.

                                                                                                                                                    Important

                                                                                                                                                    HashPartitioner places null keys in 0th partition.

                                                                                                                                                    HashPartitioner is used as the default Partitioner.

                                                                                                                                                    "},{"location":"rdd/HashPartitioner/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                    HashPartitioner takes the following to be created:

                                                                                                                                                    • Number of partitions"},{"location":"rdd/HashPartitioner/#number-of-partitions","title":"Number of Partitions
                                                                                                                                                      numPartitions: Int\n

                                                                                                                                                      numPartitions returns the given number of partitions.

                                                                                                                                                      numPartitions\u00a0is part of the Partitioner abstraction.

                                                                                                                                                      ","text":""},{"location":"rdd/HashPartitioner/#partition-for-key","title":"Partition for Key
                                                                                                                                                      getPartition(\n  key: Any): Int\n

                                                                                                                                                      For null keys getPartition simply returns 0.

                                                                                                                                                      For non-null keys, getPartition uses the Object.hashCode of the key modulo the number of partitions. For negative results, getPartition adds the number of partitions to make it non-negative.

                                                                                                                                                      getPartition\u00a0is part of the Partitioner abstraction.

                                                                                                                                                      ","text":""},{"location":"rdd/LocalCheckpointRDD/","title":"LocalCheckpointRDD","text":"

                                                                                                                                                      LocalCheckpointRDD[T] is a CheckpointRDD.

                                                                                                                                                      "},{"location":"rdd/LocalCheckpointRDD/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                      LocalCheckpointRDD takes the following to be created:

                                                                                                                                                      • RDD
                                                                                                                                                      • SparkContext
                                                                                                                                                      • RDD ID
                                                                                                                                                      • Number of Partitions

                                                                                                                                                        LocalCheckpointRDD is created\u00a0when:

                                                                                                                                                        • LocalRDDCheckpointData is requested to doCheckpoint
                                                                                                                                                        "},{"location":"rdd/LocalCheckpointRDD/#partitions","title":"Partitions
                                                                                                                                                        getPartitions: Array[Partition]\n

                                                                                                                                                        getPartitions\u00a0is part of the RDD abstraction.

                                                                                                                                                        getPartitions creates a CheckpointRDDPartition for every input partition (index).

                                                                                                                                                        ","text":""},{"location":"rdd/LocalCheckpointRDD/#computing-partition","title":"Computing Partition
                                                                                                                                                        compute(\n  partition: Partition,\n  context: TaskContext): Iterator[T]\n

                                                                                                                                                        compute\u00a0is part of the RDD abstraction.

                                                                                                                                                        compute merely throws an SparkException (that explains the reason):

                                                                                                                                                        Checkpoint block [RDDBlockId] not found! Either the executor\nthat originally checkpointed this partition is no longer alive, or the original RDD is\nunpersisted. If this problem persists, you may consider using `rdd.checkpoint()`\ninstead, which is slower than local checkpointing but more fault-tolerant.\"\n
                                                                                                                                                        ","text":""},{"location":"rdd/LocalRDDCheckpointData/","title":"LocalRDDCheckpointData","text":"

                                                                                                                                                        LocalRDDCheckpointData is a RDDCheckpointData.

                                                                                                                                                        "},{"location":"rdd/LocalRDDCheckpointData/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                        LocalRDDCheckpointData takes the following to be created:

                                                                                                                                                        • RDD

                                                                                                                                                          LocalRDDCheckpointData is created\u00a0when:

                                                                                                                                                          • RDD is requested to localCheckpoint
                                                                                                                                                          "},{"location":"rdd/LocalRDDCheckpointData/#docheckpoint","title":"doCheckpoint
                                                                                                                                                          doCheckpoint(): CheckpointRDD[T]\n

                                                                                                                                                          doCheckpoint\u00a0is part of the RDDCheckpointData abstraction.

                                                                                                                                                          doCheckpoint creates a LocalCheckpointRDD with the RDD. doCheckpoint triggers caching any missing partitions (by checking availability of the RDDBlockIds for the partitions in the BlockManagerMaster).

                                                                                                                                                          Extra Spark Job

                                                                                                                                                          If there are any missing partitions (RDDBlockIds) doCheckpoint requests the SparkContext to run a Spark job with the RDD and the missing partitions.

                                                                                                                                                          doCheckpointmakes sure that the StorageLevel of the RDD uses disk (among other persistence storages). If not, doCheckpoint\u00a0throws an AssertionError:

                                                                                                                                                          Storage level [level] is not appropriate for local checkpointing\n
                                                                                                                                                          ","text":""},{"location":"rdd/MapPartitionsRDD/","title":"MapPartitionsRDD","text":"

                                                                                                                                                          MapPartitionsRDD[U, T] is a RDD that transforms (maps) input T records into Us using partition function.

                                                                                                                                                          MapPartitionsRDD is a RDD that has exactly one-to-one narrow dependency on the parent RDD.

                                                                                                                                                          "},{"location":"rdd/MapPartitionsRDD/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                          MapPartitionsRDD takes the following to be created:

                                                                                                                                                          • Parent RDD (RDD[T])
                                                                                                                                                          • Partition Function
                                                                                                                                                          • preservesPartitioning flag
                                                                                                                                                          • isFromBarrier Flag
                                                                                                                                                          • isOrderSensitive flag

                                                                                                                                                            MapPartitionsRDD is created when:

                                                                                                                                                            • PairRDDFunctions is requested to mapValues and flatMapValues
                                                                                                                                                            • RDD is requested to map, flatMap, filter, glom, mapPartitions, mapPartitionsWithIndexInternal, mapPartitionsInternal, mapPartitionsWithIndex
                                                                                                                                                            • RDDBarrier is requested to mapPartitions, mapPartitionsWithIndex
                                                                                                                                                            "},{"location":"rdd/MapPartitionsRDD/#barrier-rdd","title":"Barrier RDD","text":"

                                                                                                                                                            MapPartitionsRDD can be a barrier RDD in Barrier Execution Mode.

                                                                                                                                                            "},{"location":"rdd/MapPartitionsRDD/#isFromBarrier","title":"isFromBarrier Flag","text":"

                                                                                                                                                            MapPartitionsRDD can be given isFromBarrier flag when created.

                                                                                                                                                            isFromBarrier flag is assumed disabled (false) and can only be enabled (true) using RDDBarrier transformations:

                                                                                                                                                            • RDDBarrier.mapPartitions
                                                                                                                                                            • RDDBarrier.mapPartitionsWithIndex
                                                                                                                                                            "},{"location":"rdd/MapPartitionsRDD/#isBarrier_","title":"isBarrier_","text":"RDD
                                                                                                                                                            isBarrier_ : Boolean\n

                                                                                                                                                            isBarrier_ is part of the RDD abstraction.

                                                                                                                                                            isBarrier_ is enabled (true) when either this MapPartitionsRDD is isFromBarrier or any of the parent RDDs is isBarrier. Otherwise, isBarrier_ is disabled (false).

                                                                                                                                                            "},{"location":"rdd/NarrowDependency/","title":"NarrowDependency","text":"

                                                                                                                                                            NarrowDependency[T] is an extension of the Dependency abstraction for narrow dependencies (of RDD[T]s) where each partition of the child RDD depends on a small number of partitions of the parent RDD.

                                                                                                                                                            ","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#getparents","title":"getParents
                                                                                                                                                            getParents(\n  partitionId: Int): Seq[Int]\n

                                                                                                                                                            The parent partitions for a given child partition

                                                                                                                                                            Used when:

                                                                                                                                                            • DAGScheduler is requested for the preferred locations (of a partition of an RDD)
                                                                                                                                                            ","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#implementations","title":"Implementations","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#onetoonedependency","title":"OneToOneDependency

                                                                                                                                                            OneToOneDependency is a NarrowDependency with getParents returning a single-element collection with the given partitionId.

                                                                                                                                                            val myRdd = sc.parallelize(0 to 9).map((_, 1))\n\nscala> :type myRdd\norg.apache.spark.rdd.RDD[(Int, Int)]\n\nscala> myRdd.dependencies.foreach(println)\norg.apache.spark.OneToOneDependency@801fe56\n\nimport org.apache.spark.OneToOneDependency\nval dep = myRdd.dependencies.head.asInstanceOf[OneToOneDependency[(_, _)]]\n\nscala> println(dep.getParents(0))\nList(0)\n\nscala> println(dep.getParents(1))\nList(1)\n
                                                                                                                                                            ","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#prunedependency","title":"PruneDependency

                                                                                                                                                            PruneDependency is a NarrowDependency that represents a dependency between the PartitionPruningRDD and the parent RDD (with a subset of partitions of the parents).

                                                                                                                                                            ","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#rangedependency","title":"RangeDependency

                                                                                                                                                            RangeDependency is a NarrowDependency that represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.

                                                                                                                                                            Used in UnionRDD (SparkContext.union).

                                                                                                                                                            val r1 = sc.range(0, 4)\nval r2 = sc.range(5, 9)\n\nval unioned = sc.union(r1, r2)\n\nscala> unioned.dependencies.foreach(println)\norg.apache.spark.RangeDependency@76b0e1d9\norg.apache.spark.RangeDependency@3f3e51e0\n\nimport org.apache.spark.RangeDependency\nval dep = unioned.dependencies.head.asInstanceOf[RangeDependency[(_, _)]]\n\nscala> println(dep.getParents(0))\nList(0)\n
                                                                                                                                                            ","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                            NarrowDependency takes the following to be created:

                                                                                                                                                            • RDD[T]

                                                                                                                                                              Abstract Class

                                                                                                                                                              NarrowDependency is an abstract class and cannot be created directly. It is created indirectly for the concrete NarrowDependencies.

                                                                                                                                                              ","tags":["DeveloperApi"]},{"location":"rdd/NewHadoopRDD/","title":"NewHadoopRDD","text":"

                                                                                                                                                              == [[NewHadoopRDD]] NewHadoopRDD

                                                                                                                                                              NewHadoopRDD is an rdd:index.md[RDD] of K keys and V values.

                                                                                                                                                              <NewHadoopRDD is created>> when:

                                                                                                                                                              • SparkContext.newAPIHadoopFile
                                                                                                                                                              • SparkContext.newAPIHadoopRDD
                                                                                                                                                              • (indirectly) SparkContext.binaryFiles
                                                                                                                                                              • (indirectly) SparkContext.wholeTextFiles

                                                                                                                                                              NOTE: NewHadoopRDD is the base RDD of BinaryFileRDD and WholeTextFileRDD.

                                                                                                                                                              === [[getPreferredLocations]] getPreferredLocations Method

                                                                                                                                                              CAUTION: FIXME

                                                                                                                                                              === [[creating-instance]] Creating NewHadoopRDD Instance

                                                                                                                                                              NewHadoopRDD takes the following when created:

                                                                                                                                                              • [[sc]] SparkContext.md[]
                                                                                                                                                              • [[inputFormatClass]] HDFS' InputFormat[K, V]
                                                                                                                                                              • [[keyClass]] K class name
                                                                                                                                                              • [[valueClass]] V class name
                                                                                                                                                              • [[_conf]] transient HDFS' Configuration

                                                                                                                                                              NewHadoopRDD initializes the <>."},{"location":"rdd/OrderedRDDFunctions/","title":"OrderedRDDFunctions","text":"

                                                                                                                                                              class OrderedRDDFunctions[\n  K: Ordering : ClassTag,\n  V: ClassTag,\n  P <: Product2[K, V] : ClassTag]\n

                                                                                                                                                              OrderedRDDFunctions adds extra operators to RDDs of (key, value) pairs (RDD[(K, V)]) where the K key is sortable (i.e. any key type K that has an implicit Ordering[K] in scope).

                                                                                                                                                              Tip

                                                                                                                                                              Learn more about Ordering in the Scala Standard Library documentation.

                                                                                                                                                              "},{"location":"rdd/OrderedRDDFunctions/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                              OrderedRDDFunctions takes the following to be created:

                                                                                                                                                              • RDD of Ps

                                                                                                                                                                OrderedRDDFunctions is created using RDD.rddToOrderedRDDFunctions implicit method.

                                                                                                                                                                "},{"location":"rdd/OrderedRDDFunctions/#filterbyrange","title":"filterByRange
                                                                                                                                                                filterByRange(\n  lower: K,\n  upper: K): RDD[P]\n

                                                                                                                                                                filterByRange...FIXME

                                                                                                                                                                ","text":""},{"location":"rdd/OrderedRDDFunctions/#repartitionandsortwithinpartitions","title":"repartitionAndSortWithinPartitions
                                                                                                                                                                repartitionAndSortWithinPartitions(\n  partitioner: Partitioner): RDD[(K, V)]\n

                                                                                                                                                                repartitionAndSortWithinPartitions creates a ShuffledRDD with the given Partitioner.

                                                                                                                                                                Note

                                                                                                                                                                repartitionAndSortWithinPartitions is a generalization of sortByKey operator.

                                                                                                                                                                ","text":""},{"location":"rdd/OrderedRDDFunctions/#sortbykey","title":"sortByKey
                                                                                                                                                                sortByKey(\n  ascending: Boolean = true,\n  numPartitions: Int = self.partitions.length): RDD[(K, V)]\n

                                                                                                                                                                sortByKey creates a ShuffledRDD (with the RDD and a RangePartitioner).

                                                                                                                                                                Note

                                                                                                                                                                sortByKey is a specialization of repartitionAndSortWithinPartitions operator.

                                                                                                                                                                sortByKey is used when:

                                                                                                                                                                • RDD.sortBy high-level operator is used
                                                                                                                                                                ","text":""},{"location":"rdd/PairRDDFunctions/","title":"PairRDDFunctions","text":"

                                                                                                                                                                PairRDDFunctions is an extension of RDD API for additional high-level operators to work with key-value RDDs (RDD[(K, V)]).

                                                                                                                                                                PairRDDFunctions is available in RDDs of key-value pairs via Scala implicit conversion.

                                                                                                                                                                The gist of PairRDDFunctions is combineByKeyWithClassTag.

                                                                                                                                                                "},{"location":"rdd/PairRDDFunctions/#aggregatebykey","title":"aggregateByKey
                                                                                                                                                                aggregateByKey[U: ClassTag](\n  zeroValue: U)(\n  seqOp: (U, V) => U,\n  combOp: (U, U) => U): RDD[(K, U)] // (1)!\naggregateByKey[U: ClassTag](\n  zeroValue: U,\n  numPartitions: Int)(seqOp: (U, V) => U,\n  combOp: (U, U) => U): RDD[(K, U)] // (2)!\naggregateByKey[U: ClassTag](\n  zeroValue: U,\n  partitioner: Partitioner)(\n  seqOp: (U, V) => U,\n  combOp: (U, U) => U): RDD[(K, U)]\n
                                                                                                                                                                1. Uses the default Partitioner
                                                                                                                                                                2. Creates a HashPartitioner with the given numPartitions partitions

                                                                                                                                                                aggregateByKey...FIXME

                                                                                                                                                                ","text":""},{"location":"rdd/PairRDDFunctions/#combinebykey","title":"combineByKey
                                                                                                                                                                combineByKey[C](\n  createCombiner: V => C,\n  mergeValue: (C, V) => C,\n  mergeCombiners: (C, C) => C): RDD[(K, C)]\ncombineByKey[C](\n  createCombiner: V => C,\n  mergeValue: (C, V) => C,\n  mergeCombiners: (C, C) => C,\n  numPartitions: Int): RDD[(K, C)]\ncombineByKey[C](\n  createCombiner: V => C,\n  mergeValue: (C, V) => C,\n  mergeCombiners: (C, C) => C,\n  partitioner: Partitioner,\n  mapSideCombine: Boolean = true,\n  serializer: Serializer = null): RDD[(K, C)]\n
                                                                                                                                                                1. Uses the default Partitioner
                                                                                                                                                                2. Creates a HashPartitioner with the given numPartitions partitions

                                                                                                                                                                combineByKey...FIXME

                                                                                                                                                                ","text":""},{"location":"rdd/PairRDDFunctions/#combinebykeywithclasstag","title":"combineByKeyWithClassTag
                                                                                                                                                                combineByKeyWithClassTag[C](\n  createCombiner: V => C,\n  mergeValue: (C, V) => C,\n  mergeCombiners: (C, C) => C)(implicit ct: ClassTag[C]): RDD[(K, C)] // (1)!\ncombineByKeyWithClassTag[C](\n  createCombiner: V => C,\n  mergeValue: (C, V) => C,\n  mergeCombiners: (C, C) => C,\n  numPartitions: Int)(implicit ct: ClassTag[C]): RDD[(K, C)] // (2)!\ncombineByKeyWithClassTag[C](\n  createCombiner: V => C,\n  mergeValue: (C, V) => C,\n  mergeCombiners: (C, C) => C,\n  partitioner: Partitioner,\n  mapSideCombine: Boolean = true,\n  serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)]\n
                                                                                                                                                                1. Uses the default Partitioner
                                                                                                                                                                2. Uses a HashPartitioner (with the given numPartitions)

                                                                                                                                                                combineByKeyWithClassTag creates an Aggregator for the given aggregation functions.

                                                                                                                                                                combineByKeyWithClassTag branches off per the given Partitioner.

                                                                                                                                                                If the input partitioner and the RDD's are the same, combineByKeyWithClassTag simply mapPartitions on the RDD with the following arguments:

                                                                                                                                                                • Iterator of the Aggregator

                                                                                                                                                                • preservesPartitioning flag turned on

                                                                                                                                                                If the input partitioner is different than the RDD's, combineByKeyWithClassTag creates a ShuffledRDD (with the Serializer, the Aggregator, and the mapSideCombine flag).

                                                                                                                                                                ","text":""},{"location":"rdd/PairRDDFunctions/#usage","title":"Usage

                                                                                                                                                                combineByKeyWithClassTag lays the foundation for the following high-level RDD key-value pair transformations:

                                                                                                                                                                • aggregateByKey
                                                                                                                                                                • combineByKey
                                                                                                                                                                • countApproxDistinctByKey
                                                                                                                                                                • foldByKey
                                                                                                                                                                • groupByKey
                                                                                                                                                                • reduceByKey
                                                                                                                                                                ","text":""},{"location":"rdd/PairRDDFunctions/#requirements","title":"Requirements

                                                                                                                                                                combineByKeyWithClassTag requires that the mergeCombiners is defined (not-null) or throws an IllegalArgumentException:

                                                                                                                                                                mergeCombiners must be defined\n

                                                                                                                                                                combineByKeyWithClassTag throws a SparkException for the keys being of type array with the mapSideCombine flag enabled:

                                                                                                                                                                Cannot use map-side combining with array keys.\n

                                                                                                                                                                combineByKeyWithClassTag throws a SparkException for the keys being of type array with the partitioner being a HashPartitioner:

                                                                                                                                                                HashPartitioner cannot partition array keys.\n
                                                                                                                                                                ","text":""},{"location":"rdd/PairRDDFunctions/#example","title":"Example
                                                                                                                                                                val nums = sc.parallelize(0 to 9, numSlices = 4)\nval groups = nums.keyBy(_ % 2)\ndef createCombiner(n: Int) = {\n  println(s\"createCombiner($n)\")\n  n\n}\ndef mergeValue(n1: Int, n2: Int) = {\n  println(s\"mergeValue($n1, $n2)\")\n  n1 + n2\n}\ndef mergeCombiners(c1: Int, c2: Int) = {\n  println(s\"mergeCombiners($c1, $c2)\")\n  c1 + c2\n}\nval countByGroup = groups.combineByKeyWithClassTag(\n  createCombiner,\n  mergeValue,\n  mergeCombiners)\nprintln(countByGroup.toDebugString)\n/*\n(4) ShuffledRDD[3] at combineByKeyWithClassTag at <console>:31 []\n +-(4) MapPartitionsRDD[1] at keyBy at <console>:25 []\n    |  ParallelCollectionRDD[0] at parallelize at <console>:24 []\n*/\n
                                                                                                                                                                ","text":""},{"location":"rdd/PairRDDFunctions/#countapproxdistinctbykey","title":"countApproxDistinctByKey
                                                                                                                                                                countApproxDistinctByKey(\n  relativeSD: Double = 0.05): RDD[(K, Long)] // (1)!\ncountApproxDistinctByKey(\n  relativeSD: Double,\n  numPartitions: Int): RDD[(K, Long)] // (2)!\ncountApproxDistinctByKey(\n  relativeSD: Double,\n  partitioner: Partitioner): RDD[(K, Long)]\ncountApproxDistinctByKey(\n  p: Int,\n  sp: Int,\n  partitioner: Partitioner): RDD[(K, Long)]\n
                                                                                                                                                                1. Uses the default Partitioner
                                                                                                                                                                2. Creates a HashPartitioner with the given numPartitions partitions

                                                                                                                                                                countApproxDistinctByKey...FIXME

                                                                                                                                                                ","text":""},{"location":"rdd/PairRDDFunctions/#foldbykey","title":"foldByKey
                                                                                                                                                                foldByKey(\n  zeroValue: V)(\n  func: (V, V) => V): RDD[(K, V)] // (1)!\nfoldByKey(\n  zeroValue: V,\n  numPartitions: Int)(\n  func: (V, V) => V): RDD[(K, V)] // (2)!\nfoldByKey(\n  zeroValue: V,\n  partitioner: Partitioner)(\n  func: (V, V) => V): RDD[(K, V)]\n
                                                                                                                                                                1. Uses the default Partitioner
                                                                                                                                                                2. Creates a HashPartitioner with the given numPartitions partitions

                                                                                                                                                                foldByKey...FIXME

                                                                                                                                                                foldByKey is used when:

                                                                                                                                                                • RDD.treeAggregate high-level operator is used
                                                                                                                                                                ","text":""},{"location":"rdd/PairRDDFunctions/#groupbykey","title":"groupByKey
                                                                                                                                                                groupByKey(): RDD[(K, Iterable[V])] // (1)!\ngroupByKey(\n  numPartitions: Int): RDD[(K, Iterable[V])] // (2)!\ngroupByKey(\n  partitioner: Partitioner): RDD[(K, Iterable[V])]\n
                                                                                                                                                                1. Uses the default Partitioner
                                                                                                                                                                2. Creates a HashPartitioner with the given numPartitions partitions

                                                                                                                                                                groupByKey...FIXME

                                                                                                                                                                groupByKey is used when:

                                                                                                                                                                • RDD.groupBy high-level operator is used
                                                                                                                                                                ","text":""},{"location":"rdd/PairRDDFunctions/#partitionby","title":"partitionBy
                                                                                                                                                                partitionBy(\n  partitioner: Partitioner): RDD[(K, V)]\n

                                                                                                                                                                partitionBy...FIXME

                                                                                                                                                                ","text":""},{"location":"rdd/PairRDDFunctions/#reducebykey","title":"reduceByKey
                                                                                                                                                                reduceByKey(\n  func: (V, V) => V): RDD[(K, V)] // (1)!\nreduceByKey(\n  func: (V, V) => V,\n  numPartitions: Int): RDD[(K, V)] // (2)!\nreduceByKey(\n  partitioner: Partitioner,\n  func: (V, V) => V): RDD[(K, V)]\n
                                                                                                                                                                1. Uses the default Partitioner
                                                                                                                                                                2. Creates a HashPartitioner with the given numPartitions partitions

                                                                                                                                                                reduceByKey is sort of a particular case of aggregateByKey.

                                                                                                                                                                reduceByKey is used when:

                                                                                                                                                                • RDD.distinct high-level operator is used
                                                                                                                                                                ","text":""},{"location":"rdd/PairRDDFunctions/#saveasnewapihadoopfile","title":"saveAsNewAPIHadoopFile
                                                                                                                                                                saveAsNewAPIHadoopFile(\n  path: String,\n  keyClass: Class[_],\n  valueClass: Class[_],\n  outputFormatClass: Class[_ <: NewOutputFormat[_, _]],\n  conf: Configuration = self.context.hadoopConfiguration): Unit\nsaveAsNewAPIHadoopFile[F <: NewOutputFormat[K, V]](\n  path: String)(implicit fm: ClassTag[F]): Unit\n

                                                                                                                                                                saveAsNewAPIHadoopFile creates a new Job (Hadoop MapReduce) for the given Configuration (Hadoop).

                                                                                                                                                                saveAsNewAPIHadoopFile configures the Job (with the given keyClass, valueClass and outputFormatClass).

                                                                                                                                                                saveAsNewAPIHadoopFile sets mapreduce.output.fileoutputformat.outputdir configuration property to be the given path and saveAsNewAPIHadoopDataset.

                                                                                                                                                                ","text":""},{"location":"rdd/PairRDDFunctions/#saveasnewapihadoopdataset","title":"saveAsNewAPIHadoopDataset
                                                                                                                                                                saveAsNewAPIHadoopDataset(\n  conf: Configuration): Unit\n

                                                                                                                                                                saveAsNewAPIHadoopDataset creates a new HadoopMapReduceWriteConfigUtil (with the given Configuration) and writes the RDD out.

                                                                                                                                                                Configuration should have all the relevant output params set (an output format, output paths, e.g. a table name to write to) in the same way as it would be configured for a Hadoop MapReduce job.

                                                                                                                                                                ","text":""},{"location":"rdd/ParallelCollectionRDD/","title":"ParallelCollectionRDD","text":"

                                                                                                                                                                ParallelCollectionRDD is an RDD of a collection of elements with numSlices partitions and optional locationPrefs.

                                                                                                                                                                ParallelCollectionRDD is the result of SparkContext.parallelize and SparkContext.makeRDD methods.

                                                                                                                                                                The data collection is split on to numSlices slices.

                                                                                                                                                                It uses ParallelCollectionPartition.

                                                                                                                                                                "},{"location":"rdd/Partition/","title":"Partition","text":"

                                                                                                                                                                Partition is a <> of a <> of a RDD.

                                                                                                                                                                NOTE: A partition is missing when it has not be computed yet.

                                                                                                                                                                [[contract]] [[index]] Partition is identified by an partition index that is a unique identifier of a partition of a RDD.

                                                                                                                                                                "},{"location":"rdd/Partition/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/Partition/#index-int","title":"index: Int","text":""},{"location":"rdd/Partitioner/","title":"Partitioner","text":"

                                                                                                                                                                Partitioner is an abstraction of partitioners that define how the elements in a key-value pair RDD are partitioned by key.

                                                                                                                                                                Partitioner maps keys to partition IDs (from 0 to numPartitions exclusive).

                                                                                                                                                                Partitioner ensures that records with the same key are in the same partition.

                                                                                                                                                                Partitioner is a Java Serializable.

                                                                                                                                                                "},{"location":"rdd/Partitioner/#contract","title":"Contract","text":""},{"location":"rdd/Partitioner/#partition-for-key","title":"Partition for Key
                                                                                                                                                                getPartition(\n  key: Any): Int\n

                                                                                                                                                                Partition ID for the given key

                                                                                                                                                                ","text":""},{"location":"rdd/Partitioner/#number-of-partitions","title":"Number of Partitions
                                                                                                                                                                numPartitions: Int\n
                                                                                                                                                                ","text":""},{"location":"rdd/Partitioner/#implementations","title":"Implementations","text":"
                                                                                                                                                                • HashPartitioner
                                                                                                                                                                • RangePartitioner
                                                                                                                                                                "},{"location":"rdd/RDD/","title":"RDD \u2014 Description of Distributed Computation","text":"

                                                                                                                                                                RDD[T] is an abstraction of fault-tolerant resilient distributed datasets that are mere descriptions of computations over a distributed collection of records (of type T).

                                                                                                                                                                "},{"location":"rdd/RDD/#contract","title":"Contract","text":""},{"location":"rdd/RDD/#compute","title":"Computing Partition","text":"
                                                                                                                                                                compute(\n  split: Partition,\n  context: TaskContext): Iterator[T]\n

                                                                                                                                                                Computes the input Partition (with the TaskContext) to produce values (of type T)

                                                                                                                                                                See:

                                                                                                                                                                • LocalCheckpointRDD
                                                                                                                                                                • MapPartitionsRDD
                                                                                                                                                                • ReliableCheckpointRDD
                                                                                                                                                                • ShuffledRDD

                                                                                                                                                                Used when:

                                                                                                                                                                • RDD is requested to computeOrReadCheckpoint
                                                                                                                                                                "},{"location":"rdd/RDD/#getPartitions","title":"Partitions","text":"
                                                                                                                                                                getPartitions: Array[Partition]\n

                                                                                                                                                                Partitions of this RDD

                                                                                                                                                                See:

                                                                                                                                                                • LocalCheckpointRDD
                                                                                                                                                                • MapPartitionsRDD
                                                                                                                                                                • ReliableCheckpointRDD
                                                                                                                                                                • ShuffledRDD

                                                                                                                                                                Used when:

                                                                                                                                                                • RDD is requested for the partitions
                                                                                                                                                                "},{"location":"rdd/RDD/#implementations","title":"Implementations","text":"
                                                                                                                                                                • CheckpointRDD
                                                                                                                                                                • CoalescedRDD
                                                                                                                                                                • CoGroupedRDD
                                                                                                                                                                • HadoopRDD
                                                                                                                                                                • MapPartitionsRDD
                                                                                                                                                                • NewHadoopRDD
                                                                                                                                                                • ParallelCollectionRDD
                                                                                                                                                                • ReliableCheckpointRDD
                                                                                                                                                                • ShuffledRDD
                                                                                                                                                                • others
                                                                                                                                                                "},{"location":"rdd/RDD/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                RDD takes the following to be created:

                                                                                                                                                                • SparkContext
                                                                                                                                                                • Dependencies (Parent RDDs that should be computed successfully before this RDD) Abstract Class

                                                                                                                                                                  RDD\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete RDDs.

                                                                                                                                                                  "},{"location":"rdd/RDD/#barrier-rdd","title":"Barrier RDD","text":"

                                                                                                                                                                  Barrier RDD is a RDD with the isBarrier flag enabled.

                                                                                                                                                                  ShuffledRDD can never be a barrier RDD as it overrides isBarrier method to be always disabled (false).

                                                                                                                                                                  "},{"location":"rdd/RDD/#isBarrier","title":"isBarrier","text":"
                                                                                                                                                                  isBarrier(): Boolean\n

                                                                                                                                                                  isBarrier is the value of isBarrier_.

                                                                                                                                                                  isBarrier is used when:

                                                                                                                                                                  • DAGScheduler is requested to submitMissingTasks (that are either ShuffleMapStages to create ShuffleMapTasks or ResultStage to create ResultTasks)
                                                                                                                                                                  • RDDInfo is created
                                                                                                                                                                  • ShuffleDependency is requested to canShuffleMergeBeEnabled
                                                                                                                                                                  • DAGScheduler is requested to checkBarrierStageWithRDDChainPattern, checkBarrierStageWithDynamicAllocation, checkBarrierStageWithNumSlots, handleTaskCompletion (FetchFailed case to mark a map stage as broken)
                                                                                                                                                                  "},{"location":"rdd/RDD/#isBarrier_","title":"isBarrier_","text":"
                                                                                                                                                                  isBarrier_ : Boolean // (1)!\n
                                                                                                                                                                  1. @transient protected lazy val

                                                                                                                                                                  isBarrier_ is enabled (true) when there is at least one barrier RDD among the parent RDDs (excluding ShuffleDependencyies).

                                                                                                                                                                  Note

                                                                                                                                                                  isBarrier_ is overriden by PythonRDD and MapPartitionsRDD that both accept isFromBarrier flag.

                                                                                                                                                                  "},{"location":"rdd/RDD/#resourceProfile","title":"ResourceProfile (Stage-Level Scheduling)","text":"

                                                                                                                                                                  RDD can be assigned a ResourceProfile using RDD.withResources method.

                                                                                                                                                                  val rdd: RDD[_] = ...\nrdd\n  .withResources(...) // request resources for a computation\n  .mapPartitions(...) // the computation\n

                                                                                                                                                                  RDD uses resourceProfile internal registry for the ResourceProfile that is undefined initially.

                                                                                                                                                                  The ResourceProfile is available using RDD.getResourceProfile method.

                                                                                                                                                                  "},{"location":"rdd/RDD/#withResources","title":"withResources","text":"
                                                                                                                                                                  withResources(\n  rp: ResourceProfile): this.type\n

                                                                                                                                                                  withResources sets the given ResourceProfile as the resourceProfile and requests the ResourceProfileManager to add the resource profile.

                                                                                                                                                                  "},{"location":"rdd/RDD/#getResourceProfile","title":"getResourceProfile","text":"
                                                                                                                                                                  getResourceProfile(): ResourceProfile\n

                                                                                                                                                                  getResourceProfile returns the resourceProfile (if defined) or null.

                                                                                                                                                                  getResourceProfile is used when:

                                                                                                                                                                  • DAGScheduler is requested for the ShuffleDependencies and ResourceProfiles of an RDD
                                                                                                                                                                  "},{"location":"rdd/RDD/#preferredLocations","title":"Preferred Locations (Placement Preferences of Partition)","text":"
                                                                                                                                                                  preferredLocations(\n  split: Partition): Seq[String]\n
                                                                                                                                                                  Final Method

                                                                                                                                                                  preferredLocations is a Scala final method and may not be overridden in subclasses.

                                                                                                                                                                  Learn more in the Scala Language Specification.

                                                                                                                                                                  preferredLocations requests the CheckpointRDD for the preferred locations for the given Partition if this RDD is checkpointed orgetPreferredLocations.

                                                                                                                                                                  preferredLocations is a template method that uses getPreferredLocations that custom RDDs can override to specify placement preferences on their own.

                                                                                                                                                                  preferredLocations\u00a0is used when:

                                                                                                                                                                  • DAGScheduler is requested for preferred locations
                                                                                                                                                                  "},{"location":"rdd/RDD/#partitions","title":"Partitions","text":"
                                                                                                                                                                  partitions: Array[Partition]\n
                                                                                                                                                                  Final Method

                                                                                                                                                                  partitions is a Scala final method and may not be overridden in subclasses.

                                                                                                                                                                  Learn more in the Scala Language Specification.

                                                                                                                                                                  partitions requests the CheckpointRDD for the partitions if this RDD is checkpointed.

                                                                                                                                                                  Otherwise, when this RDD is not checkpointed, partitions getPartitions (and caches it in the partitions_).

                                                                                                                                                                  Note

                                                                                                                                                                  getPartitions is an abstract method that custom RDDs are required to provide.

                                                                                                                                                                  partitions has the property that their internal index should be equal to their position in this RDD.

                                                                                                                                                                  partitions\u00a0is used when:

                                                                                                                                                                  • DAGScheduler is requested to getPreferredLocsInternal
                                                                                                                                                                  • SparkContext is requested to run a job
                                                                                                                                                                  • others
                                                                                                                                                                  "},{"location":"rdd/RDD/#dependencies","title":"dependencies","text":"
                                                                                                                                                                  dependencies: Seq[Dependency[_]]\n
                                                                                                                                                                  Final Method

                                                                                                                                                                  dependencies is a Scala final method and may not be overridden in subclasses.

                                                                                                                                                                  Learn more in the Scala Language Specification.

                                                                                                                                                                  dependencies branches off based on checkpointRDD (and availability of CheckpointRDD).

                                                                                                                                                                  With CheckpointRDD available (this RDD is checkpointed), dependencies returns a OneToOneDependency with the CheckpointRDD.

                                                                                                                                                                  Otherwise, when this RDD is not checkpointed, dependencies getDependencies (and caches it in the dependencies_).

                                                                                                                                                                  Note

                                                                                                                                                                  getDependencies is an abstract method that custom RDDs are required to provide.

                                                                                                                                                                  "},{"location":"rdd/RDD/#checkpoint","title":"Reliable Checkpointing","text":"
                                                                                                                                                                  checkpoint(): Unit\n

                                                                                                                                                                  Public API

                                                                                                                                                                  checkpoint is part of the public API.

                                                                                                                                                                  Procedure

                                                                                                                                                                  checkpoint is a procedure (returns Unit) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).

                                                                                                                                                                  checkpoint creates a new ReliableRDDCheckpointData (with this RDD) and saves it in checkpointData registry.

                                                                                                                                                                  checkpoint does nothing when the checkpointData registry has already been defined.

                                                                                                                                                                  checkpoint throws a SparkException when the checkpoint directory is not specified:

                                                                                                                                                                  Checkpoint directory has not been set in the SparkContext\n
                                                                                                                                                                  "},{"location":"rdd/RDD/#checkpointData","title":"RDDCheckpointData","text":"
                                                                                                                                                                  checkpointData: Option[RDDCheckpointData[T]]\n

                                                                                                                                                                  RDD defines checkpointData internal registry for a RDDCheckpointData[T] (of T type of this RDD).

                                                                                                                                                                  The checkpointData registry is undefined (None) initially when this RDD is created and can hold a value after the following RDD API operators:

                                                                                                                                                                  RDD Operator RDDCheckpointData RDD.checkpoint ReliableRDDCheckpointData RDD.localCheckpoint LocalRDDCheckpointData

                                                                                                                                                                  checkpointData is used when:

                                                                                                                                                                  • isCheckpointedAndMaterialized
                                                                                                                                                                  • isLocallyCheckpointed
                                                                                                                                                                  • isReliablyCheckpointed
                                                                                                                                                                  • getCheckpointFile
                                                                                                                                                                  • doCheckpoint
                                                                                                                                                                  "},{"location":"rdd/RDD/#checkpointRDD","title":"CheckpointRDD
                                                                                                                                                                  checkpointRDD: Option[CheckpointRDD[T]]\n

                                                                                                                                                                  checkpointRDD returns the CheckpointRDD of the RDDCheckpointData (if defined and so this RDD checkpointed).

                                                                                                                                                                  checkpointRDD is used when:

                                                                                                                                                                  • RDD is requested for the dependencies, partitions and preferred locations (all using final methods!)
                                                                                                                                                                  ","text":""},{"location":"rdd/RDD/#doCheckpoint","title":"doCheckpoint","text":"
                                                                                                                                                                  doCheckpoint(): Unit\n

                                                                                                                                                                  RDD.doCheckpoint, SparkContext.runJob and Dataset.checkpoint

                                                                                                                                                                  doCheckpoint is called every time a Spark job is submitted (using SparkContext.runJob).

                                                                                                                                                                  I found it quite interesting at the very least.

                                                                                                                                                                  doCheckpoint is triggered when Dataset.checkpoint operator (Spark SQL) is executed (with eager flag on) which will likely trigger one or more Spark jobs on the underlying RDD anyway.

                                                                                                                                                                  Procedure

                                                                                                                                                                  doCheckpoint is a procedure (returns Unit) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).

                                                                                                                                                                  Does nothing unless checkpointData is defined

                                                                                                                                                                  My understanding is that doCheckpoint does nothing (noop) unless the RDDCheckpointData is defined.

                                                                                                                                                                  doCheckpoint executes all the following in checkpoint scope.

                                                                                                                                                                  doCheckpoint turns the doCheckpointCalled flag on (to prevent multiple executions).

                                                                                                                                                                  doCheckpoint branches off based on whether a RDDCheckpointData is defined or not:

                                                                                                                                                                  1. With the RDDCheckpointData defined, doCheckpoint checks out the checkpointAllMarkedAncestors flag and if enabled, doCheckpoint requests the Dependencies for the RDD that are in turn requested to doCheckpoint themselves. Otherwise, doCheckpoint requests the RDDCheckpointData to checkpoint.

                                                                                                                                                                  2. With the RDDCheckpointData undefined, doCheckpoint requests the Dependencies (of this RDD) for their RDDs that are in turn requested to doCheckpoint themselves (recursively).

                                                                                                                                                                  Note

                                                                                                                                                                  With the RDDCheckpointData defined, requesting doCheckpoint of the Dependencies is guarded by checkpointAllMarkedAncestors flag.

                                                                                                                                                                  doCheckpoint skips execution if called earlier.

                                                                                                                                                                  CheckpointRDD

                                                                                                                                                                  CheckpointRDD is not checkpoint again (and does nothing when requested to do so).

                                                                                                                                                                  doCheckpoint is used when:

                                                                                                                                                                  • SparkContext is requested to run a job synchronously
                                                                                                                                                                  "},{"location":"rdd/RDD/#iterator","title":"iterator
                                                                                                                                                                  iterator(\n  split: Partition,\n  context: TaskContext): Iterator[T]\n

                                                                                                                                                                  iterator...FIXME

                                                                                                                                                                  Final Method

                                                                                                                                                                  iterator is a final method and may not be overridden in subclasses. See 5.2.6 final in the Scala Language Specification.

                                                                                                                                                                  ","text":""},{"location":"rdd/RDD/#getorcompute","title":"getOrCompute
                                                                                                                                                                  getOrCompute(\n  partition: Partition,\n  context: TaskContext): Iterator[T]\n

                                                                                                                                                                  getOrCompute...FIXME

                                                                                                                                                                  ","text":""},{"location":"rdd/RDD/#computeorreadcheckpoint","title":"computeOrReadCheckpoint
                                                                                                                                                                  computeOrReadCheckpoint(\n  split: Partition,\n  context: TaskContext): Iterator[T]\n

                                                                                                                                                                  computeOrReadCheckpoint...FIXME

                                                                                                                                                                  ","text":""},{"location":"rdd/RDD/#debugging-recursive-dependencies","title":"Debugging Recursive Dependencies
                                                                                                                                                                  toDebugString: String\n

                                                                                                                                                                  toDebugString returns a RDD Lineage Graph.

                                                                                                                                                                  val wordCount = sc.textFile(\"README.md\")\n  .flatMap(_.split(\"\\\\s+\"))\n  .map((_, 1))\n  .reduceByKey(_ + _)\n\nscala> println(wordCount.toDebugString)\n(2) ShuffledRDD[21] at reduceByKey at <console>:24 []\n +-(2) MapPartitionsRDD[20] at map at <console>:24 []\n    |  MapPartitionsRDD[19] at flatMap at <console>:24 []\n    |  README.md MapPartitionsRDD[18] at textFile at <console>:24 []\n    |  README.md HadoopRDD[17] at textFile at <console>:24 []\n

                                                                                                                                                                  toDebugString uses indentations to indicate a shuffle boundary.

                                                                                                                                                                  The numbers in round brackets show the level of parallelism at each stage, e.g. (2) in the above output.

                                                                                                                                                                  scala> println(wordCount.getNumPartitions)\n2\n

                                                                                                                                                                  With spark.logLineage enabled, toDebugString is printed out when executing an action.

                                                                                                                                                                  $ ./bin/spark-shell --conf spark.logLineage=true\n\nscala> sc.textFile(\"README.md\", 4).count\n...\n15/10/17 14:46:42 INFO SparkContext: Starting job: count at <console>:25\n15/10/17 14:46:42 INFO SparkContext: RDD's recursive dependencies:\n(4) MapPartitionsRDD[1] at textFile at <console>:25 []\n |  README.md HadoopRDD[0] at textFile at <console>:25 []\n
                                                                                                                                                                  ","text":""},{"location":"rdd/RDD/#coalesce","title":"coalesce
                                                                                                                                                                  coalesce(\n  numPartitions: Int,\n  shuffle: Boolean = false,\n  partitionCoalescer: Option[PartitionCoalescer] = Option.empty)\n  (implicit ord: Ordering[T] = null): RDD[T]\n

                                                                                                                                                                  coalesce...FIXME

                                                                                                                                                                  coalesce is used when:

                                                                                                                                                                  • RDD.repartition high-level operator is used
                                                                                                                                                                  ","text":""},{"location":"rdd/RDD/#implicit-methods","title":"Implicit Methods","text":""},{"location":"rdd/RDD/#rddtoorderedrddfunctions","title":"rddToOrderedRDDFunctions
                                                                                                                                                                  rddToOrderedRDDFunctions[K : Ordering : ClassTag, V: ClassTag](\n  rdd: RDD[(K, V)]): OrderedRDDFunctions[K, V, (K, V)]\n

                                                                                                                                                                  rddToOrderedRDDFunctions is an Scala implicit method that creates an OrderedRDDFunctions.

                                                                                                                                                                  rddToOrderedRDDFunctions is used (implicitly) when:

                                                                                                                                                                  • RDD.sortBy
                                                                                                                                                                  • PairRDDFunctions.combineByKey
                                                                                                                                                                  ","text":""},{"location":"rdd/RDD/#withScope","title":"withScope
                                                                                                                                                                  withScope[U](\n  body: => U): U\n

                                                                                                                                                                  withScope withScope with this SparkContext.

                                                                                                                                                                  Note

                                                                                                                                                                  withScope is used for most (if not all) RDD API operators.

                                                                                                                                                                  ","text":""},{"location":"rdd/RDDCheckpointData/","title":"RDDCheckpointData","text":"

                                                                                                                                                                  RDDCheckpointData is an abstraction of information related to RDD checkpointing.

                                                                                                                                                                  == [[implementations]] Available RDDCheckpointDatas

                                                                                                                                                                  [cols=\"30,70\",options=\"header\",width=\"100%\"] |=== | RDDCheckpointData | Description

                                                                                                                                                                  | rdd:LocalRDDCheckpointData.md[LocalRDDCheckpointData] | [[LocalRDDCheckpointData]]

                                                                                                                                                                  | rdd:ReliableRDDCheckpointData.md[ReliableRDDCheckpointData] | [[ReliableRDDCheckpointData]] Reliable Checkpointing

                                                                                                                                                                  |===

                                                                                                                                                                  == [[creating-instance]] Creating Instance

                                                                                                                                                                  RDDCheckpointData takes the following to be created:

                                                                                                                                                                  • [[rdd]] rdd:RDD.md[RDD]

                                                                                                                                                                  == [[Serializable]] RDDCheckpointData as Serializable

                                                                                                                                                                  RDDCheckpointData is java.io.Serializable.

                                                                                                                                                                  == [[cpState]] States

                                                                                                                                                                  • [[Initialized]] Initialized

                                                                                                                                                                  • [[CheckpointingInProgress]] CheckpointingInProgress

                                                                                                                                                                  • [[Checkpointed]] Checkpointed

                                                                                                                                                                  == [[checkpoint]] Checkpointing RDD

                                                                                                                                                                  "},{"location":"rdd/RDDCheckpointData/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/RDDCheckpointData/#checkpoint-checkpointrddt","title":"checkpoint(): CheckpointRDD[T]","text":"

                                                                                                                                                                  checkpoint changes the <> to <> only when in <> state. Otherwise, checkpoint does nothing and returns.

                                                                                                                                                                  checkpoint <> that gives an CheckpointRDD (that is the <> internal registry).

                                                                                                                                                                  checkpoint changes the <> to <>.

                                                                                                                                                                  In the end, checkpoint requests the given <> to rdd:RDD.md#markCheckpointed[markCheckpointed].

                                                                                                                                                                  checkpoint is used when RDD is requested to rdd:RDD.md#doCheckpoint[doCheckpoint].

                                                                                                                                                                  == [[doCheckpoint]] doCheckpoint Method

                                                                                                                                                                  "},{"location":"rdd/RDDCheckpointData/#source-scala_1","title":"[source, scala]","text":""},{"location":"rdd/RDDCheckpointData/#docheckpoint-checkpointrddt","title":"doCheckpoint(): CheckpointRDD[T]","text":"

                                                                                                                                                                  doCheckpoint is used when RDDCheckpointData is requested to <>."},{"location":"rdd/RDDOperationScope/","title":"RDDOperationScope","text":""},{"location":"rdd/RDDOperationScope/#withScope","title":"withScope","text":"

                                                                                                                                                                  withScope[T](\n  sc: SparkContext,\n  name: String,\n  allowNesting: Boolean,\n  ignoreParent: Boolean)(\n  body: => T): T\nwithScope[T](\n  sc: SparkContext,\n  allowNesting: Boolean = false)(\n  body: => T): T\n
                                                                                                                                                                  name Argument Value Caller checkpoint RDD.doCheckpoint Some method name Executed without name The name of a physical operator (with no Exec suffix) SparkPlan.executeQuery (Spark SQL)

                                                                                                                                                                  withScope...FIXME

                                                                                                                                                                  withScope is used when:

                                                                                                                                                                  • RDD is requested to doCheckpoint and withScope (for most, if not all, RDD API operators)
                                                                                                                                                                  • SparkContext is requested to withScope (for most, if not all, SparkContext API operators)
                                                                                                                                                                  • SparkPlan (Spark SQL) is requested to executeQuery
                                                                                                                                                                  "},{"location":"rdd/RangePartitioner/","title":"RangePartitioner","text":"

                                                                                                                                                                  RangePartitioner is a Partitioner that partitions sortable records by range into roughly equal ranges (that can be used for bucketed partitioning).

                                                                                                                                                                  RangePartitioner is used for sortByKey operator (mostly).

                                                                                                                                                                  "},{"location":"rdd/RangePartitioner/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                  RangePartitioner takes the following to be created:

                                                                                                                                                                  • Hint for the number of partitions
                                                                                                                                                                  • Key-Value RDD (RDD[_ <: Product2[K, V]])
                                                                                                                                                                  • ascending flag (default: true)
                                                                                                                                                                  • samplePointsPerPartitionHint (default: 20)"},{"location":"rdd/RangePartitioner/#number-of-partitions","title":"Number of Partitions
                                                                                                                                                                    numPartitions: Int\n

                                                                                                                                                                    numPartitions\u00a0is part of the Partitioner abstraction.

                                                                                                                                                                    numPartitions is 1 more than the length of the range bounds (since the number of range bounds is 0 for 0 or 1 partitions).

                                                                                                                                                                    ","text":""},{"location":"rdd/RangePartitioner/#partition-for-key","title":"Partition for Key
                                                                                                                                                                    getPartition(\n  key: Any): Int\n

                                                                                                                                                                    getPartition\u00a0is part of the Partitioner abstraction.

                                                                                                                                                                    getPartition branches off based on the length of the range bounds.

                                                                                                                                                                    For up to 128 range bounds, getPartition is either the first range bound (from the rangeBounds) for which the key value is greater than the value of the range bound or 128 (if no value was found among the rangeBounds). getPartition starts finding a candidate partition number from 0 and walks over the rangeBounds until a range bound for which the given key value is greater than the value of the range bound is found or there are no more rangeBounds. getPartition increments the candidate partition candidate every iteration.

                                                                                                                                                                    For the number of the rangeBounds above 128, getPartition...FIXME

                                                                                                                                                                    In the end, getPartition returns the candidate partition number for the ascending enabled, or flips it (to be the number of the rangeBounds minus the candidate partition number), otheriwse.

                                                                                                                                                                    ","text":""},{"location":"rdd/RangePartitioner/#range-bounds","title":"Range Bounds
                                                                                                                                                                    rangeBounds: Array[K]\n

                                                                                                                                                                    rangeBounds is an array of upper bounds.

                                                                                                                                                                    For the number of partitions up to and including 1, rangeBounds is an empty array.

                                                                                                                                                                    For more than 1 partitions, rangeBounds determines the sample size per partitions. The total sample size is the samplePointsPerPartitionHint multiplied by the number of partitions capped by 1e6. rangeBounds allows for 3x over-sample per partition.

                                                                                                                                                                    rangeBounds sketches the keys of the input rdd (with the sampleSizePerPartition).

                                                                                                                                                                    Note

                                                                                                                                                                    There is more going on in rangeBounds.

                                                                                                                                                                    In the end, rangeBounds determines the bounds.

                                                                                                                                                                    ","text":""},{"location":"rdd/RangePartitioner/#determinebounds","title":"determineBounds
                                                                                                                                                                    determineBounds[K: Ordering](\n  candidates: ArrayBuffer[(K, Float)],\n  partitions: Int): Array[K]\n

                                                                                                                                                                    determineBounds...FIXME

                                                                                                                                                                    ","text":""},{"location":"rdd/ReliableCheckpointRDD/","title":"ReliableCheckpointRDD","text":"

                                                                                                                                                                    ReliableCheckpointRDD is an CheckpointRDD.

                                                                                                                                                                    "},{"location":"rdd/ReliableCheckpointRDD/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                    ReliableCheckpointRDD takes the following to be created:

                                                                                                                                                                    • [[sc]] SparkContext.md[]
                                                                                                                                                                    • [[checkpointPath]] Checkpoint Directory (on a Hadoop DFS-compatible file system)
                                                                                                                                                                    • <<_partitioner, Partitioner>>

                                                                                                                                                                    ReliableCheckpointRDD is created when:

                                                                                                                                                                    • ReliableCheckpointRDD utility is used to <>.

                                                                                                                                                                    • SparkContext is requested to SparkContext.md#checkpointFile[checkpointFile]

                                                                                                                                                                    • == [[checkpointPartitionerFileName]] Checkpointed Partitioner File

                                                                                                                                                                      ReliableCheckpointRDD uses _partitioner as the name of the file in the <> with the <> serialized to.

                                                                                                                                                                      == [[partitioner]] Partitioner

                                                                                                                                                                      ReliableCheckpointRDD can be given a rdd:Partitioner.md[Partitioner] to be created.

                                                                                                                                                                      When rdd:RDD.md#partitioner[requested for the Partitioner] (as an RDD), ReliableCheckpointRDD returns the one it was created with or <>.

                                                                                                                                                                      == [[writeRDDToCheckpointDirectory]] Writing RDD to Checkpoint Directory

                                                                                                                                                                      "},{"location":"rdd/ReliableCheckpointRDD/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                      writeRDDToCheckpointDirectoryT: ClassTag: ReliableCheckpointRDD[T]

                                                                                                                                                                      writeRDDToCheckpointDirectory...FIXME

                                                                                                                                                                      writeRDDToCheckpointDirectory is used when ReliableRDDCheckpointData is requested to rdd:ReliableRDDCheckpointData.md#doCheckpoint[doCheckpoint].

                                                                                                                                                                      == [[writePartitionerToCheckpointDir]] Writing Partitioner to Checkpoint Directory

                                                                                                                                                                      "},{"location":"rdd/ReliableCheckpointRDD/#sourcescala","title":"[source,scala]","text":"

                                                                                                                                                                      writePartitionerToCheckpointDir( sc: SparkContext, partitioner: Partitioner, checkpointDirPath: Path): Unit

                                                                                                                                                                      writePartitionerToCheckpointDir creates the <> with the buffer size based on configuration-properties.md#spark.buffer.size[spark.buffer.size] configuration property.

                                                                                                                                                                      writePartitionerToCheckpointDir requests the core:SparkEnv.md#serializer[default Serializer] for a new serializer:Serializer.md#newInstance[SerializerInstance].

                                                                                                                                                                      writePartitionerToCheckpointDir requests the SerializerInstance to serializer:SerializerInstance.md#serializeStream[serialize the output stream] and serializer:DeserializationStream.md#writeObject[writes] the given Partitioner.

                                                                                                                                                                      In the end, writePartitionerToCheckpointDir prints out the following DEBUG message to the logs:

                                                                                                                                                                      "},{"location":"rdd/ReliableCheckpointRDD/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableCheckpointRDD/#written-partitioner-to-partitionerfilepath","title":"Written partitioner to [partitionerFilePath]","text":"

                                                                                                                                                                      In case of any non-fatal exception, writePartitionerToCheckpointDir prints out the following DEBUG message to the logs:

                                                                                                                                                                      "},{"location":"rdd/ReliableCheckpointRDD/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableCheckpointRDD/#error-writing-partitioner-partitioner-to-checkpointdirpath","title":"Error writing partitioner [partitioner] to [checkpointDirPath]","text":"

                                                                                                                                                                      writePartitionerToCheckpointDir is used when ReliableCheckpointRDD is requested to <>.

                                                                                                                                                                      == [[readCheckpointedPartitionerFile]] Reading Partitioner from Checkpointed Directory

                                                                                                                                                                      "},{"location":"rdd/ReliableCheckpointRDD/#sourcescala_1","title":"[source,scala]","text":"

                                                                                                                                                                      readCheckpointedPartitionerFile( sc: SparkContext, checkpointDirPath: String): Option[Partitioner]

                                                                                                                                                                      readCheckpointedPartitionerFile opens the <> with the buffer size based on configuration-properties.md#spark.buffer.size[spark.buffer.size] configuration property.

                                                                                                                                                                      readCheckpointedPartitionerFile requests the core:SparkEnv.md#serializer[default Serializer] for a new serializer:Serializer.md#newInstance[SerializerInstance].

                                                                                                                                                                      readCheckpointedPartitionerFile requests the SerializerInstance to serializer:SerializerInstance.md#deserializeStream[deserialize the input stream] and serializer:DeserializationStream.md#readObject[read the Partitioner] from the partitioner file.

                                                                                                                                                                      readCheckpointedPartitionerFile prints out the following DEBUG message to the logs and returns the partitioner.

                                                                                                                                                                      "},{"location":"rdd/ReliableCheckpointRDD/#sourceplaintext_2","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableCheckpointRDD/#read-partitioner-from-partitionerfilepath","title":"Read partitioner from [partitionerFilePath]","text":"

                                                                                                                                                                      In case of FileNotFoundException or any non-fatal exceptions, readCheckpointedPartitionerFile prints out a corresponding message to the logs and returns None.

                                                                                                                                                                      readCheckpointedPartitionerFile is used when ReliableCheckpointRDD is requested for the <>.

                                                                                                                                                                      == [[logging]] Logging

                                                                                                                                                                      Enable ALL logging level for org.apache.spark.rdd.ReliableCheckpointRDD$ logger to see what happens inside.

                                                                                                                                                                      Add the following line to conf/log4j.properties:

                                                                                                                                                                      "},{"location":"rdd/ReliableCheckpointRDD/#sourceplaintext_3","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableCheckpointRDD/#log4jloggerorgapachesparkrddreliablecheckpointrddall","title":"log4j.logger.org.apache.spark.rdd.ReliableCheckpointRDD$=ALL","text":"

                                                                                                                                                                      Refer to spark-logging.md[Logging].

                                                                                                                                                                      "},{"location":"rdd/ReliableRDDCheckpointData/","title":"ReliableRDDCheckpointData","text":"

                                                                                                                                                                      ReliableRDDCheckpointData is a RDDCheckpointData for Reliable Checkpointing.

                                                                                                                                                                      "},{"location":"rdd/ReliableRDDCheckpointData/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                      ReliableRDDCheckpointData takes the following to be created:

                                                                                                                                                                      • [[rdd]] rdd:RDD.md[++RDD[T]++]

                                                                                                                                                                      ReliableRDDCheckpointData is created for rdd:RDD.md#checkpoint[RDD.checkpoint] operator.

                                                                                                                                                                      == [[cpDir]][[checkpointPath]] Checkpoint Directory

                                                                                                                                                                      ReliableRDDCheckpointData creates a subdirectory of the SparkContext.md#checkpointDir[application-wide checkpoint directory] for <> the given <>.

                                                                                                                                                                      The name of the subdirectory uses the rdd:RDD.md#id[unique identifier] of the <>:"},{"location":"rdd/ReliableRDDCheckpointData/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableRDDCheckpointData/#rdd-id","title":"rdd-[id]","text":"

                                                                                                                                                                      == [[doCheckpoint]] Checkpointing RDD

                                                                                                                                                                      "},{"location":"rdd/ReliableRDDCheckpointData/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/ReliableRDDCheckpointData/#docheckpoint-checkpointrddt","title":"doCheckpoint(): CheckpointRDD[T]","text":"

                                                                                                                                                                      doCheckpoint rdd:ReliableCheckpointRDD.md#writeRDDToCheckpointDirectory[writes] the <> to the <> (that creates a new RDD).

                                                                                                                                                                      With configuration-properties.md#spark.cleaner.referenceTracking.cleanCheckpoints[spark.cleaner.referenceTracking.cleanCheckpoints] configuration property enabled, doCheckpoint requests the SparkContext.md#cleaner[ContextCleaner] to core:ContextCleaner.md#registerRDDCheckpointDataForCleanup[registerRDDCheckpointDataForCleanup] for the new RDD.

                                                                                                                                                                      In the end, doCheckpoint prints out the following INFO message to the logs and returns the new RDD.

                                                                                                                                                                      "},{"location":"rdd/ReliableRDDCheckpointData/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableRDDCheckpointData/#done-checkpointing-rdd-id-to-cpdir-new-parent-is-rdd-id","title":"Done checkpointing RDD [id] to [cpDir], new parent is RDD [id]","text":"

                                                                                                                                                                      doCheckpoint is part of the rdd:RDDCheckpointData.md#doCheckpoint[RDDCheckpointData] abstraction.

                                                                                                                                                                      "},{"location":"rdd/ShuffleDependency/","title":"ShuffleDependency","text":"

                                                                                                                                                                      ShuffleDependency is a Dependency on the output of a ShuffleMapStage of a key-value RDD.

                                                                                                                                                                      ShuffleDependency uses the RDD to know the number of (map-side/pre-shuffle) partitions and the Partitioner for the number of (reduce-size/post-shuffle) partitions.

                                                                                                                                                                      ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]\n
                                                                                                                                                                      ","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                      ShuffleDependency takes the following to be created:

                                                                                                                                                                      • RDD (RDD[_ <: Product2[K, V]])
                                                                                                                                                                      • Partitioner
                                                                                                                                                                      • Serializer (default: SparkEnv.get.serializer)
                                                                                                                                                                      • Optional Key Ordering (default: undefined)
                                                                                                                                                                      • Optional Aggregator
                                                                                                                                                                      • mapSideCombine
                                                                                                                                                                      • ShuffleWriteProcessor
                                                                                                                                                                      • ShuffleDependency is created\u00a0when:

                                                                                                                                                                        • CoGroupedRDD is requested for the dependencies (for RDDs with different partitioners)
                                                                                                                                                                        • ShuffledRDD is requested for the dependencies
                                                                                                                                                                        • ShuffleExchangeExec (Spark SQL) physical operator is requested to prepare a ShuffleDependency

                                                                                                                                                                        When created, ShuffleDependency gets the shuffle id.

                                                                                                                                                                        ShuffleDependency registers itself with the ShuffleManager and gets a ShuffleHandle (available as shuffleHandle). ShuffleDependency uses SparkEnv to access the ShuffleManager.

                                                                                                                                                                        In the end, ShuffleDependency registers itself with the ContextCleaner (if configured) and the ShuffleDriverComponents.

                                                                                                                                                                        ","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#aggregator","title":"Aggregator
                                                                                                                                                                        aggregator: Option[Aggregator[K, V, C]]\n

                                                                                                                                                                        ShuffleDependency can be given a map/reduce-side Aggregator when created.

                                                                                                                                                                        ShuffleDependency asserts (when created) that an Aggregator is defined when the mapSideCombine flag is enabled.

                                                                                                                                                                        aggregator\u00a0is used when:

                                                                                                                                                                        • SortShuffleWriter is requested to write records (for mapper tasks)
                                                                                                                                                                        • BlockStoreShuffleReader is requested to read records (for reducer tasks)
                                                                                                                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#map-size-partial-aggregation-flag","title":"Map-Size Partial Aggregation Flag

                                                                                                                                                                        ShuffleDependency uses a mapSideCombine flag that controls whether to perform map-side partial aggregation (map-side combine) using the Aggregator.

                                                                                                                                                                        mapSideCombine is disabled (false) by default and can be enabled (true) for some uses of ShuffledRDD.

                                                                                                                                                                        ShuffleDependency requires that the optional Aggregator is actually defined for the flag enabled.

                                                                                                                                                                        mapSideCombine is used when:

                                                                                                                                                                        • BlockStoreShuffleReader is requested to read combined records for a reduce task
                                                                                                                                                                        • SortShuffleManager is requested to register a shuffle
                                                                                                                                                                        • SortShuffleWriter is requested to write records
                                                                                                                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#partitioner","title":"Partitioner

                                                                                                                                                                        ShuffleDependency is given a Partitioner (when created).

                                                                                                                                                                        ShuffleDependency uses the Partitioner to partition the shuffle output.

                                                                                                                                                                        The Partitioner is used when:

                                                                                                                                                                        • SortShuffleWriter is requested to write records (and create an ExternalSorter)
                                                                                                                                                                        • others (FIXME)
                                                                                                                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#shufflewriteprocessor","title":"ShuffleWriteProcessor

                                                                                                                                                                        ShuffleDependency can be given a ShuffleWriteProcessor when created.

                                                                                                                                                                        The ShuffleWriteProcessor is used when:

                                                                                                                                                                        • ShuffleMapTask is requested to runTask (to write partition records out to the shuffle system)
                                                                                                                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#shuffle-id","title":"Shuffle ID
                                                                                                                                                                        shuffleId: Int\n

                                                                                                                                                                        ShuffleDependency is identified uniquely by an application-wide shuffle ID (that is requested from SparkContext when created).

                                                                                                                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#shufflehandle","title":"ShuffleHandle

                                                                                                                                                                        ShuffleDependency registers itself with the ShuffleManager when created.

                                                                                                                                                                        The ShuffleHandle is used when:

                                                                                                                                                                        • CoGroupedRDDs, ShuffledRDD, and ShuffledRowRDD (Spark SQL) are requested to compute a partition (to get a ShuffleReader for a ShuffleDependency)
                                                                                                                                                                        • ShuffleMapTask is requested to run (to get a ShuffleWriter for a ShuffleDependency).
                                                                                                                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/","title":"ShuffledRDD","text":"

                                                                                                                                                                        ShuffledRDD is an RDD of key-value pairs that represents a shuffle step in a RDD lineage (and indicates start of a new stage).

                                                                                                                                                                        When requested to compute a partition, ShuffledRDD uses the one and only ShuffleDependency for a ShuffleHandle for a ShuffleReader (from the system ShuffleManager) that is used to read the (combined) key-value pairs.

                                                                                                                                                                        ","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                        ShuffledRDD takes the following to be created:

                                                                                                                                                                        • RDD (of K keys and V values)
                                                                                                                                                                        • Partitioner
                                                                                                                                                                        • ShuffledRDD is created\u00a0for the following RDD operators:

                                                                                                                                                                          • OrderedRDDFunctions.sortByKey and OrderedRDDFunctions.repartitionAndSortWithinPartitions

                                                                                                                                                                          • PairRDDFunctions.combineByKeyWithClassTag and PairRDDFunctions.partitionBy

                                                                                                                                                                          • RDD.coalesce (with shuffle flag enabled)

                                                                                                                                                                          ","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#partitioner","title":"Partitioner

                                                                                                                                                                          ShuffledRDD is given a Partitioner when created:

                                                                                                                                                                          • RangePartitioner for sortByKey
                                                                                                                                                                          • HashPartitioner for coalesce
                                                                                                                                                                          • Whatever passed in to the following high-level RDD operators when different from the current Partitioner (of the RDD):
                                                                                                                                                                            • repartitionAndSortWithinPartitions
                                                                                                                                                                            • combineByKeyWithClassTag
                                                                                                                                                                            • partitionBy

                                                                                                                                                                          The given Partitioner is the partitioner of this ShuffledRDD.

                                                                                                                                                                          The Partitioner is also used when:

                                                                                                                                                                          • getDependencies (to create the only ShuffleDependency)
                                                                                                                                                                          • getPartitions (to create as many ShuffledRDDPartitions as the numPartitions of the Partitioner)
                                                                                                                                                                          ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#dependencies","title":"Dependencies Signature
                                                                                                                                                                          getDependencies: Seq[Dependency[_]]\n

                                                                                                                                                                          getDependencies is part of the RDD abstraction.

                                                                                                                                                                          getDependencies uses the user-specified Serializer, if defined, or requests the current SerializerManager for one.

                                                                                                                                                                          getDependencies uses the mapSideCombine internal flag for the types of the keys and values (i.e. K and C or K and V when the flag is enabled or not, respectively).

                                                                                                                                                                          In the end, getDependencies creates a single ShuffleDependency (with the previous RDD, the Partitioner, and the Serializer).

                                                                                                                                                                          ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#computing-partition","title":"Computing Partition Signature
                                                                                                                                                                          compute(\n  split: Partition,\n  context: TaskContext): Iterator[(K, C)]\n

                                                                                                                                                                          compute is part of the RDD abstraction.

                                                                                                                                                                          compute assumes that ShuffleDependency is the first dependency among the dependencies (and the only one per getDependencies).

                                                                                                                                                                          compute uses the SparkEnv to access the ShuffleManager. compute requests the ShuffleManager for the ShuffleReader based on the following:

                                                                                                                                                                          ShuffleReader Value ShuffleHandle ShuffleHandle of the ShuffleDependency startPartition The index of the given split partition endPartition index + 1

                                                                                                                                                                          In the end, compute requests the ShuffleReader to read the (combined) key-value pairs (of type (K, C)).

                                                                                                                                                                          ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#key-value-and-combiner-types","title":"Key, Value and Combiner Types
                                                                                                                                                                          class ShuffledRDD[K: ClassTag, V: ClassTag, C: ClassTag]\n

                                                                                                                                                                          ShuffledRDD is given an RDD of K keys and V values to be created.

                                                                                                                                                                          When computed, ShuffledRDD produces pairs of K keys and C values.

                                                                                                                                                                          ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#isbarrier-flag","title":"isBarrier Flag

                                                                                                                                                                          ShuffledRDD has isBarrier flag always disabled (false).

                                                                                                                                                                          ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#map-side-combine-flag","title":"Map-Side Combine Flag

                                                                                                                                                                          ShuffledRDD uses a map-side combine flag to create a ShuffleDependency when requested for the dependencies (there is always only one).

                                                                                                                                                                          The flag is disabled (false) by default and can be changed using setMapSideCombine method.

                                                                                                                                                                          setMapSideCombine(\n  mapSideCombine: Boolean): ShuffledRDD[K, V, C]\n

                                                                                                                                                                          setMapSideCombine is used for PairRDDFunctions.combineByKeyWithClassTag transformation (which defaults to the flag enabled).

                                                                                                                                                                          ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#placement-preferences-of-partition","title":"Placement Preferences of Partition Signature
                                                                                                                                                                          getPreferredLocations(\n  partition: Partition): Seq[String]\n

                                                                                                                                                                          getPreferredLocations is part of the RDD abstraction.

                                                                                                                                                                          getPreferredLocations requests MapOutputTrackerMaster for the preferred locations of the given partition (BlockManagers with the most map outputs).

                                                                                                                                                                          getPreferredLocations uses SparkEnv to access the current MapOutputTrackerMaster.

                                                                                                                                                                          ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#shuffledrddpartition","title":"ShuffledRDDPartition

                                                                                                                                                                          ShuffledRDDPartition gets an index to be created (that in turn is the index of partitions as calculated by the Partitioner).

                                                                                                                                                                          ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#user-specified-serializer","title":"User-Specified Serializer

                                                                                                                                                                          User-specified Serializer for the single ShuffleDependency dependency

                                                                                                                                                                          userSpecifiedSerializer: Option[Serializer] = None\n

                                                                                                                                                                          userSpecifiedSerializer is undefined (None) by default and can be changed using setSerializer method (that is used for PairRDDFunctions.combineByKeyWithClassTag transformation).

                                                                                                                                                                          ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#demos","title":"Demos","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#shuffledrdd-and-coalesce","title":"ShuffledRDD and coalesce
                                                                                                                                                                          val data = sc.parallelize(0 to 9)\nval coalesced = data.coalesce(numPartitions = 4, shuffle = true)\nscala> println(coalesced.toDebugString)\n(4) MapPartitionsRDD[9] at coalesce at <pastie>:75 []\n |  CoalescedRDD[8] at coalesce at <pastie>:75 []\n |  ShuffledRDD[7] at coalesce at <pastie>:75 []\n +-(16) MapPartitionsRDD[6] at coalesce at <pastie>:75 []\n    |   ParallelCollectionRDD[5] at parallelize at <pastie>:74 []\n
                                                                                                                                                                          ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#shuffledrdd-and-sortbykey","title":"ShuffledRDD and sortByKey
                                                                                                                                                                          val data = sc.parallelize(0 to 9)\nval grouped = rdd.groupBy(_ % 2)\nval sorted = grouped.sortByKey(numPartitions = 2)\nscala> println(sorted.toDebugString)\n(2) ShuffledRDD[15] at sortByKey at <console>:74 []\n +-(4) ShuffledRDD[12] at groupBy at <console>:74 []\n    +-(4) MapPartitionsRDD[11] at groupBy at <console>:74 []\n       |  MapPartitionsRDD[9] at coalesce at <pastie>:75 []\n       |  CoalescedRDD[8] at coalesce at <pastie>:75 []\n       |  ShuffledRDD[7] at coalesce at <pastie>:75 []\n       +-(16) MapPartitionsRDD[6] at coalesce at <pastie>:75 []\n          |   ParallelCollectionRDD[5] at parallelize at <pastie>:74 []\n
                                                                                                                                                                          ","text":"","tags":["DeveloperApi"]},{"location":"rdd/checkpointing/","title":"RDD Checkpointing","text":"

                                                                                                                                                                          RDD Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system.

                                                                                                                                                                          There are two types of checkpointing:

                                                                                                                                                                          • <> - RDD checkpointing that saves the actual intermediate RDD data to a reliable distributed file system (e.g. Hadoop DFS)
                                                                                                                                                                          • <> - RDD checkpointing that saves the data to a local file system

                                                                                                                                                                            It's up to a Spark application developer to decide when and how to checkpoint using RDD.checkpoint() method.

                                                                                                                                                                            Before checkpointing is used, a Spark developer has to set the checkpoint directory using SparkContext.setCheckpointDir(directory: String) method.

                                                                                                                                                                            == [[reliable-checkpointing]] Reliable Checkpointing

                                                                                                                                                                            You call SparkContext.setCheckpointDir(directory: String) to set the checkpoint directory - the directory where RDDs are checkpointed. The directory must be a HDFS path if running on a cluster. The reason is that the driver may attempt to reconstruct the checkpointed RDD from its own local file system, which is incorrect because the checkpoint files are actually on the executor machines.

                                                                                                                                                                            You mark an RDD for checkpointing by calling RDD.checkpoint(). The RDD will be saved to a file inside the checkpoint directory and all references to its parent RDDs will be removed. This function has to be called before any job has been executed on this RDD.

                                                                                                                                                                            NOTE: It is strongly recommended that a checkpointed RDD is persisted in memory, otherwise saving it on a file will require recomputation.

                                                                                                                                                                            When an action is called on a checkpointed RDD, the following INFO message is printed out in the logs:

                                                                                                                                                                            Done checkpointing RDD 5 to [path], new parent is RDD [id]\n

                                                                                                                                                                            == [[local-checkpointing]] Local Checkpointing

                                                                                                                                                                            localCheckpoint allows to truncate RDD lineage graph while skipping the expensive step of replicating the materialized data to a reliable distributed file system.

                                                                                                                                                                            This is useful for RDDs with long lineages that need to be truncated periodically, e.g. GraphX.

                                                                                                                                                                            Local checkpointing trades fault-tolerance for performance.

                                                                                                                                                                            NOTE: The checkpoint directory set through SparkContext.setCheckpointDir is not used.

                                                                                                                                                                            == [[demo]] Demo

                                                                                                                                                                            "},{"location":"rdd/checkpointing/#sourceplaintext","title":"[source,plaintext]","text":"

                                                                                                                                                                            val rdd = sc.parallelize(0 to 9)

                                                                                                                                                                            scala> rdd.checkpoint org.apache.spark.SparkException: Checkpoint directory has not been set in the SparkContext at org.apache.spark.rdd.RDD.checkpoint(RDD.scala:1599) ... 49 elided

                                                                                                                                                                            sc.setCheckpointDir(\"/tmp/rdd-checkpoint\")

                                                                                                                                                                            // Creates a subdirectory for this SparkContext $ ls /tmp/rdd-checkpoint/ fc21e1d1-3cd9-4d51-880f-58d1dd07f783

                                                                                                                                                                            // Mark the RDD to checkpoint at the earliest action rdd.checkpoint

                                                                                                                                                                            scala> println(rdd.getCheckpointFile) Some(file:/tmp/rdd-checkpoint/fc21e1d1-3cd9-4d51-880f-58d1dd07f783/rdd-2)

                                                                                                                                                                            scala> println(ns.id) 2

                                                                                                                                                                            scala> println(rdd.getNumPartitions) 16

                                                                                                                                                                            rdd.count

                                                                                                                                                                            // Check out the checkpoint directory // You should find a directory for the checkpointed RDD, e.g. rdd-2 // The number of part-000* files is exactly the number of partitions $ ls -ltra /tmp/rdd-checkpoint/fc21e1d1-3cd9-4d51-880f-58d1dd07f783/rdd-2/part-000* | wc -l 16

                                                                                                                                                                            "},{"location":"rdd/lineage/","title":"RDD Lineage \u2014 Logical Execution Plan","text":"

                                                                                                                                                                            RDD Lineage (RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of an RDD.

                                                                                                                                                                            RDD lineage is built as a result of applying transformations to an RDD and creates a so-called logical execution plan.

                                                                                                                                                                            Note

                                                                                                                                                                            The execution DAG or physical execution plan is the DAG of stages.

                                                                                                                                                                            The above RDD graph could be the result of the following series of transformations:

                                                                                                                                                                            val r00 = sc.parallelize(0 to 9)\nval r01 = sc.parallelize(0 to 90 by 10)\nval r10 = r00.cartesian(r01)\nval r11 = r00.map(n => (n, n))\nval r12 = r00.zip(r01)\nval r13 = r01.keyBy(_ / 20)\nval r20 = Seq(r11, r12, r13).foldLeft(r10)(_ union _)\n

                                                                                                                                                                            A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.

                                                                                                                                                                            "},{"location":"rdd/lineage/#logical-execution-plan","title":"Logical Execution Plan","text":"

                                                                                                                                                                            Logical Execution Plan starts with the earliest RDDs (those with no dependencies on other RDDs or reference cached data) and ends with the RDD that produces the result of the action that has been called to execute.

                                                                                                                                                                            Note

                                                                                                                                                                            A logical plan (a DAG) is materialized and executed when SparkContext is requested to run a Spark job.

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-actions/","title":"Actions","text":"

                                                                                                                                                                            RDD Actions are RDD operations that produce concrete non-RDD values. They materialize a value in a Spark program. In other words, a RDD operation that returns a value of any type but RDD[T] is an action.

                                                                                                                                                                            action: RDD => a value\n

                                                                                                                                                                            NOTE: Actions are synchronous. You can use <> to release a calling thread while calling actions.

                                                                                                                                                                            They trigger execution of <> to return values. Simply put, an action evaluates the RDD lineage graph.

                                                                                                                                                                            You can think of actions as a valve and until action is fired, the data to be processed is not even in the pipes, i.e. transformations. Only actions can materialize the entire processing pipeline with real data.

                                                                                                                                                                            • aggregate
                                                                                                                                                                            • collect
                                                                                                                                                                            • count
                                                                                                                                                                            • countApprox*
                                                                                                                                                                            • countByValue*
                                                                                                                                                                            • first
                                                                                                                                                                            • fold
                                                                                                                                                                            • foreach
                                                                                                                                                                            • foreachPartition
                                                                                                                                                                            • max
                                                                                                                                                                            • min
                                                                                                                                                                            • reduce
                                                                                                                                                                            • saveAs* (e.g. saveAsTextFile, saveAsHadoopFile)
                                                                                                                                                                            • take
                                                                                                                                                                            • takeOrdered
                                                                                                                                                                            • takeSample
                                                                                                                                                                            • toLocalIterator
                                                                                                                                                                            • top
                                                                                                                                                                            • treeAggregate
                                                                                                                                                                            • treeReduce

                                                                                                                                                                            Actions run jobs using SparkContext.runJob or directly DAGScheduler.runJob.

                                                                                                                                                                            scala> :type words\n\nscala> words.count  // <1>\nres0: Long = 502\n

                                                                                                                                                                            TIP: You should cache RDDs you work with when you want to execute two or more actions on it for a better performance. Refer to spark-rdd-caching.md[RDD Caching and Persistence].

                                                                                                                                                                            Before calling an action, Spark does closure/function cleaning (using SparkContext.clean) to make it ready for serialization and sending over the wire to executors. Cleaning can throw a SparkException if the computation cannot be cleaned.

                                                                                                                                                                            NOTE: Spark uses ClosureCleaner to clean closures.

                                                                                                                                                                            === [[AsyncRDDActions]] AsyncRDDActions

                                                                                                                                                                            AsyncRDDActions class offers asynchronous actions that you can use on RDDs (thanks to the implicit conversion rddToAsyncRDDActions in RDD class). The methods return a <>.

                                                                                                                                                                            The following asynchronous methods are available:

                                                                                                                                                                            • countAsync
                                                                                                                                                                            • collectAsync
                                                                                                                                                                            • takeAsync
                                                                                                                                                                            • foreachAsync
                                                                                                                                                                            • foreachPartitionAsync
                                                                                                                                                                            "},{"location":"rdd/spark-rdd-caching/","title":"Caching and Persistence","text":"

                                                                                                                                                                            == RDD Caching and Persistence

                                                                                                                                                                            Caching or persistence are optimisation techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storages like disk and/or replicated.

                                                                                                                                                                            RDDs can be cached using <> operation. They can also be persisted using <> operation.

                                                                                                                                                                            The difference between cache and persist operations is purely syntactic. cache is a synonym of persist or persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY.

                                                                                                                                                                            NOTE: Due to the very small and purely syntactic difference between caching and persistence of RDDs the two terms are often used interchangeably and I will follow the \"pattern\" here.

                                                                                                                                                                            RDDs can also be <> to remove RDD from a permanent storage like memory and/or disk.

                                                                                                                                                                            === [[cache]] Caching RDD -- cache Method

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-caching/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-caching/#cache-thistype-persist","title":"cache(): this.type = persist()","text":"

                                                                                                                                                                            cache is a synonym of <> with storage:StorageLevel.md[MEMORY_ONLY storage level].

                                                                                                                                                                            === [[persist]] Persisting RDD -- persist Methods

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-caching/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                            persist(): this.type persist(newLevel: StorageLevel): this.type

                                                                                                                                                                            persist marks a RDD for persistence using newLevel storage:StorageLevel.md[storage level].

                                                                                                                                                                            You can only change the storage level once or persist reports an UnsupportedOperationException:

                                                                                                                                                                            Cannot change storage level of an RDD after it was already assigned a level\n

                                                                                                                                                                            NOTE: You can pretend to change the storage level of an RDD with already-assigned storage level only if the storage level is the same as it is currently assigned.

                                                                                                                                                                            If the RDD is marked as persistent the first time, the RDD is core:ContextCleaner.md#registerRDDForCleanup[registered to ContextCleaner] (if available) and SparkContext.md#persistRDD[SparkContext].

                                                                                                                                                                            The internal storageLevel attribute is set to the input newLevel storage level.

                                                                                                                                                                            === [[unpersist]] Unpersisting RDDs (Clearing Blocks) -- unpersist Method

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-caching/#source-scala_2","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-caching/#unpersistblocking-boolean-true-thistype","title":"unpersist(blocking: Boolean = true): this.type","text":"

                                                                                                                                                                            When called, unpersist prints the following INFO message to the logs:

                                                                                                                                                                            INFO [RddName]: Removing RDD [id] from persistence list\n

                                                                                                                                                                            It then calls SparkContext.md#unpersist[SparkContext.unpersistRDD(id, blocking)] and sets storage:StorageLevel.md[NONE storage level] as the current storage level.

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-operations/","title":"Operators","text":"

                                                                                                                                                                            == Operators - Transformations and Actions

                                                                                                                                                                            RDDs have two types of operations: spark-rdd-transformations.md[transformations] and spark-rdd-actions.md[actions].

                                                                                                                                                                            NOTE: Operators are also called operations.

                                                                                                                                                                            === Gotchas - things to watch for

                                                                                                                                                                            Even if you don't access it explicitly it cannot be referenced inside a closure as it is serialized and carried around across executors.

                                                                                                                                                                            See https://issues.apache.org/jira/browse/SPARK-5063

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-partitions/","title":"Partitions and Partitioning","text":"

                                                                                                                                                                            == Partitions and Partitioning

                                                                                                                                                                            === Introduction

                                                                                                                                                                            Depending on how you look at Spark (programmer, devop, admin), an RDD is about the content (developer's and data scientist's perspective) or how it gets spread out over a cluster (performance), i.e. how many partitions an RDD represents.

                                                                                                                                                                            A partition (aka split) is a logical chunk of a large distributed data set.

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-partitions/#caution","title":"[CAUTION]","text":"

                                                                                                                                                                            FIXME

                                                                                                                                                                            1. How does the number of partitions map to the number of tasks? How to verify it?

                                                                                                                                                                            Spark manages data using partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors.

                                                                                                                                                                            By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks.

                                                                                                                                                                            There is a one-to-one correspondence between how data is laid out in data storage like HDFS or Cassandra (it is partitioned for the same reasons).

                                                                                                                                                                            Features:

                                                                                                                                                                            • size
                                                                                                                                                                            • number
                                                                                                                                                                            • partitioning scheme
                                                                                                                                                                            • node distribution
                                                                                                                                                                            • repartitioning
                                                                                                                                                                            "},{"location":"rdd/spark-rdd-partitions/#how-does-the-mapping-between-partitions-and-tasks-correspond-to-data-locality-if-any","title":"How does the mapping between partitions and tasks correspond to data locality if any?","text":""},{"location":"rdd/spark-rdd-partitions/#tip","title":"[TIP]","text":"

                                                                                                                                                                            Read the following documentations to learn what experts say on the topic:

                                                                                                                                                                            • https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html[How Many Partitions Does An RDD Have?]

                                                                                                                                                                            By default, a partition is created for each HDFS partition, which by default is 64MB (from http://spark.apache.org/docs/latest/programming-guide.html#external-datasets[Spark's Programming Guide]).

                                                                                                                                                                            RDDs get partitioned automatically without programmer intervention. However, there are times when you'd like to adjust the size and number of partitions or the partitioning scheme according to the needs of your application.

                                                                                                                                                                            You use def getPartitions: Array[Partition] method on a RDD to know the set of partitions in this RDD.

                                                                                                                                                                            As noted in https://github.com/databricks/spark-knowledgebase/blob/master/performance_optimization/how_many_partitions_does_an_rdd_have.md#view-task-execution-against-partitions-using-the-ui[View Task Execution Against Partitions Using the UI]:

                                                                                                                                                                            When a stage executes, you can see the number of partitions for a given stage in the Spark UI.

                                                                                                                                                                            Start spark-shell and see it yourself!

                                                                                                                                                                            scala> sc.parallelize(1 to 100).count\nres0: Long = 100\n

                                                                                                                                                                            When you execute the Spark job, i.e. sc.parallelize(1 to 100).count, you should see the following in http://localhost:4040/jobs[Spark shell application UI].

                                                                                                                                                                            .The number of partition as Total tasks in UI image::spark-partitions-ui-stages.png[align=\"center\"]

                                                                                                                                                                            The reason for 8 Tasks in Total is that I'm on a 8-core laptop and by default the number of partitions is the number of all available cores.

                                                                                                                                                                            $ sysctl -n hw.ncpu\n8\n

                                                                                                                                                                            You can request for the minimum number of partitions, using the second input parameter to many transformations.

                                                                                                                                                                            scala> sc.parallelize(1 to 100, 2).count\nres1: Long = 100\n

                                                                                                                                                                            .Total tasks in UI shows 2 partitions image::spark-partitions-ui-stages-2-partitions.png[align=\"center\"]

                                                                                                                                                                            You can always ask for the number of partitions using partitions method of a RDD:

                                                                                                                                                                            scala> val ints = sc.parallelize(1 to 100, 4)\nints: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24\n\nscala> ints.partitions.size\nres2: Int = 4\n

                                                                                                                                                                            In general, smaller/more numerous partitions allow work to be distributed among more workers, but larger/fewer partitions allow work to be done in larger chunks, which may result in the work getting done more quickly as long as all workers are kept busy, due to reduced overhead.

                                                                                                                                                                            Increasing partitions count will make each partition to have less data (or not at all!)

                                                                                                                                                                            Spark can only run 1 concurrent task for every partition of an RDD, up to the number of cores in your cluster. So if you have a cluster with 50 cores, you want your RDDs to at least have 50 partitions (and probably http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism[2-3x times that]).

                                                                                                                                                                            As far as choosing a \"good\" number of partitions, you generally want at least as many as the number of executors for parallelism. You can get this computed value by calling sc.defaultParallelism.

                                                                                                                                                                            Also, the number of partitions determines how many files get generated by actions that save RDDs to files.

                                                                                                                                                                            The maximum size of a partition is ultimately limited by the available memory of an executor.

                                                                                                                                                                            In the first RDD transformation, e.g. reading from a file using sc.textFile(path, partition), the partition parameter will be applied to all further transformations and actions on this RDD.

                                                                                                                                                                            Partitions get redistributed among nodes whenever shuffle occurs. Repartitioning may cause shuffle to occur in some situations, but it is not guaranteed to occur in all cases. And it usually happens during action stage.

                                                                                                                                                                            When creating an RDD by reading a file using rdd = SparkContext().textFile(\"hdfs://.../file.txt\") the number of partitions may be smaller. Ideally, you would get the same number of blocks as you see in HDFS, but if the lines in your file are too long (longer than the block size), there will be fewer partitions.

                                                                                                                                                                            Preferred way to set up the number of partitions for an RDD is to directly pass it as the second input parameter in the call like rdd = sc.textFile(\"hdfs://.../file.txt\", 400), where 400 is the number of partitions. In this case, the partitioning makes for 400 splits that would be done by the Hadoop's TextInputFormat, not Spark and it would work much faster. It's also that the code spawns 400 concurrent tasks to try to load file.txt directly into 400 partitions.

                                                                                                                                                                            It will only work as described for uncompressed files.

                                                                                                                                                                            When using textFile with compressed files (file.txt.gz not file.txt or similar), Spark disables splitting that makes for an RDD with only 1 partition (as reads against gzipped files cannot be parallelized). In this case, to change the number of partitions you should do <>.

                                                                                                                                                                            Some operations, e.g. map, flatMap, filter, don't preserve partitioning.

                                                                                                                                                                            map, flatMap, filter operations apply a function to every partition.

                                                                                                                                                                            === [[repartitioning]][[repartition]] Repartitioning RDD -- repartition Transformation

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-partitions/#httpssparkapacheorgdocslatesttuninghtmltuning-spark-the-official-documentation-of-spark","title":"https://spark.apache.org/docs/latest/tuning.html[Tuning Spark] (the official documentation of Spark)","text":""},{"location":"rdd/spark-rdd-partitions/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-partitions/#repartitionnumpartitions-intimplicit-ord-orderingt-null-rddt","title":"repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]","text":"

                                                                                                                                                                            repartition is <> with numPartitions and shuffle enabled.

                                                                                                                                                                            With the following computation you can see that repartition(5) causes 5 tasks to be started using NODE_LOCAL data locality.

                                                                                                                                                                            scala> lines.repartition(5).count\n...\n15/10/07 08:10:00 INFO DAGScheduler: Submitting 5 missing tasks from ResultStage 7 (MapPartitionsRDD[19] at repartition at <console>:27)\n15/10/07 08:10:00 INFO TaskSchedulerImpl: Adding task set 7.0 with 5 tasks\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 17, localhost, partition 0,NODE_LOCAL, 2089 bytes)\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 1.0 in stage 7.0 (TID 18, localhost, partition 1,NODE_LOCAL, 2089 bytes)\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 2.0 in stage 7.0 (TID 19, localhost, partition 2,NODE_LOCAL, 2089 bytes)\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 3.0 in stage 7.0 (TID 20, localhost, partition 3,NODE_LOCAL, 2089 bytes)\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 4.0 in stage 7.0 (TID 21, localhost, partition 4,NODE_LOCAL, 2089 bytes)\n...\n

                                                                                                                                                                            You can see a change after executing repartition(1) causes 2 tasks to be started using PROCESS_LOCAL data locality.

                                                                                                                                                                            scala> lines.repartition(1).count\n...\n15/10/07 08:14:09 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 8 (MapPartitionsRDD[20] at repartition at <console>:27)\n15/10/07 08:14:09 INFO TaskSchedulerImpl: Adding task set 8.0 with 2 tasks\n15/10/07 08:14:09 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 22, localhost, partition 0,PROCESS_LOCAL, 2058 bytes)\n15/10/07 08:14:09 INFO TaskSetManager: Starting task 1.0 in stage 8.0 (TID 23, localhost, partition 1,PROCESS_LOCAL, 2058 bytes)\n...\n

                                                                                                                                                                            Please note that Spark disables splitting for compressed files and creates RDDs with only 1 partition. In such cases, it's helpful to use sc.textFile('demo.gz') and do repartitioning using rdd.repartition(100) as follows:

                                                                                                                                                                            rdd = sc.textFile('demo.gz')\nrdd = rdd.repartition(100)\n

                                                                                                                                                                            With the lines, you end up with rdd to be exactly 100 partitions of roughly equal in size.

                                                                                                                                                                            • rdd.repartition(N) does a shuffle to split data to match N ** partitioning is done on round robin basis

                                                                                                                                                                            TIP: If partitioning scheme doesn't work for you, you can write your own custom partitioner.

                                                                                                                                                                            TIP: It's useful to get familiar with https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html[Hadoop's TextInputFormat].

                                                                                                                                                                            === [[coalesce]] coalesce Transformation

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-partitions/#source-scala_1","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-partitions/#coalescenumpartitions-int-shuffle-boolean-falseimplicit-ord-orderingt-null-rddt","title":"coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]","text":"

                                                                                                                                                                            The coalesce transformation is used to change the number of partitions. It can trigger shuffling depending on the shuffle flag (disabled by default, i.e. false).

                                                                                                                                                                            In the following sample, you parallelize a local 10-number sequence and coalesce it first without and then with shuffling (note the shuffle parameter being false and true, respectively).

                                                                                                                                                                            Tip

                                                                                                                                                                            Use toDebugString to check out the RDD lineage graph.

                                                                                                                                                                            scala> val rdd = sc.parallelize(0 to 10, 8)\nrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24\n\nscala> rdd.partitions.size\nres0: Int = 8\n\nscala> rdd.coalesce(numPartitions=8, shuffle=false)   // <1>\nres1: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[1] at coalesce at <console>:27\n\nscala> res1.toDebugString\nres2: String =\n(8) CoalescedRDD[1] at coalesce at <console>:27 []\n |  ParallelCollectionRDD[0] at parallelize at <console>:24 []\n\nscala> rdd.coalesce(numPartitions=8, shuffle=true)\nres3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at coalesce at <console>:27\n\nscala> res3.toDebugString\nres4: String =\n(8) MapPartitionsRDD[5] at coalesce at <console>:27 []\n |  CoalescedRDD[4] at coalesce at <console>:27 []\n |  ShuffledRDD[3] at coalesce at <console>:27 []\n +-(8) MapPartitionsRDD[2] at coalesce at <console>:27 []\n    |  ParallelCollectionRDD[0] at parallelize at <console>:24 []\n
                                                                                                                                                                            <1> shuffle is false by default and it's explicitly used here for demo purposes. Note the number of partitions that remains the same as the number of partitions in the source RDD rdd.

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-transformations/","title":"Transformations -- Lazy Operations on RDD (to Create One or More RDDs)","text":"

                                                                                                                                                                            Transformations are lazy operations on an rdd:RDD.md[RDD] that create one or many new RDDs.

                                                                                                                                                                            // T and U are Scala types\ntransformation: RDD[T] => RDD[U]\ntransformation: RDD[T] => Seq[RDD[U]]\n

                                                                                                                                                                            In other words, transformations are functions that take an RDD as the input and produce one or many RDDs as the output. Transformations do not change the input RDD (since rdd:index.md#introduction[RDDs are immutable] and hence cannot be modified), but produce one or more new RDDs by applying the computations they represent.

                                                                                                                                                                            [[methods]] .(Subset of) RDD Transformations (Public API) [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                            | aggregate a| [[aggregate]]

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-transformations/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                            aggregateU( seqOp: (U, T) => U, combOp: (U, U) => U): U

                                                                                                                                                                            | barrier a| [[barrier]]

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-transformations/#source-scala_1","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#barrier-rddbarriert","title":"barrier(): RDDBarrier[T]","text":"

                                                                                                                                                                            (New in 2.4.0) Marks the current stage as a <> in <>, where Spark must launch all tasks together

                                                                                                                                                                            Internally, barrier creates a <> over the RDD

                                                                                                                                                                            | cache a| [[cache]]

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-transformations/#source-scala_2","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#cache-thistype","title":"cache(): this.type","text":"

                                                                                                                                                                            Persists the RDD with the storage:StorageLevel.md#MEMORY_ONLY[MEMORY_ONLY] storage level

                                                                                                                                                                            Synonym of <>

                                                                                                                                                                            | coalesce a| [[coalesce]]

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-transformations/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                            coalesce( numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty) (implicit ord: Ordering[T] = null): RDD[T]

                                                                                                                                                                            | filter a| [[filter]]

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-transformations/#source-scala_4","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#filterf-t-boolean-rddt","title":"filter(f: T => Boolean): RDD[T]","text":"

                                                                                                                                                                            | flatMap a| [[flatMap]]

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-transformations/#source-scala_5","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#flatmapu-rddu","title":"flatMapU: RDD[U]","text":"

                                                                                                                                                                            | map a| [[map]]

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-transformations/#source-scala_6","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#mapu-rddu","title":"mapU: RDD[U]","text":"

                                                                                                                                                                            | mapPartitions a| [[mapPartitions]]

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-transformations/#source-scala_7","title":"[source, scala]","text":"

                                                                                                                                                                            mapPartitionsU: RDD[U]

                                                                                                                                                                            | mapPartitionsWithIndex a| [[mapPartitionsWithIndex]]

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-transformations/#source-scala_8","title":"[source, scala]","text":"

                                                                                                                                                                            mapPartitionsWithIndexU: RDD[U]

                                                                                                                                                                            | randomSplit a| [[randomSplit]]

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-transformations/#source-scala_9","title":"[source, scala]","text":"

                                                                                                                                                                            randomSplit( weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]

                                                                                                                                                                            | union a| [[union]]

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-transformations/#source-scala_10","title":"[source, scala]","text":"

                                                                                                                                                                            ++(other: RDD[T]): RDD[T] union(other: RDD[T]): RDD[T]

                                                                                                                                                                            | persist a| [[persist]]

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-transformations/#source-scala_11","title":"[source, scala]","text":"

                                                                                                                                                                            persist(): this.type persist(newLevel: StorageLevel): this.type

                                                                                                                                                                            |===

                                                                                                                                                                            By applying transformations you incrementally build a RDD lineage with all the parent RDDs of the final RDD(s).

                                                                                                                                                                            Transformations are lazy, i.e. are not executed immediately. Only after calling an action are transformations executed.

                                                                                                                                                                            After executing a transformation, the result RDD(s) will always be different from their parents and can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap, union, cartesian) or the same size (e.g. map).

                                                                                                                                                                            CAUTION: There are transformations that may trigger jobs, e.g. sortBy, <>, etc.

                                                                                                                                                                            .From SparkContext by transformations to the result image::rdd-sparkcontext-transformations-action.png[align=\"center\"]

                                                                                                                                                                            Certain transformations can be pipelined which is an optimization that Spark uses to improve performance of computations.

                                                                                                                                                                            "},{"location":"rdd/spark-rdd-transformations/#sourcescala","title":"[source,scala]","text":"

                                                                                                                                                                            scala> val file = sc.textFile(\"README.md\") file: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[54] at textFile at :24

                                                                                                                                                                            scala> val allWords = file.flatMap(_.split(\"\\W+\")) allWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[55] at flatMap at :26

                                                                                                                                                                            scala> val words = allWords.filter(!_.isEmpty) words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at filter at :28

                                                                                                                                                                            scala> val pairs = words.map((_,1)) pairs: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[57] at map at :30

                                                                                                                                                                            scala> val reducedByKey = pairs.reduceByKey(_ + _) reducedByKey: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[59] at reduceByKey at :32

                                                                                                                                                                            scala> val top10words = reducedByKey.takeOrdered(10)(Ordering[Int].reverse.on(_._2)) INFO SparkContext: Starting job: takeOrdered at :34 ... INFO DAGScheduler: Job 18 finished: takeOrdered at :34, took 0.074386 s top10words: Array[(String, Int)] = Array((the,21), (to,14), (Spark,13), (for,11), (and,10), (##,8), (a,8), (run,7), (can,6), (is,6))

                                                                                                                                                                            There are two kinds of transformations:

                                                                                                                                                                            • <>
                                                                                                                                                                            • <>

                                                                                                                                                                              === [[narrow-transformations]] Narrow Transformations

                                                                                                                                                                              Narrow transformations are the result of map, filter and such that is from the data from a single partition only, i.e. it is self-sustained.

                                                                                                                                                                              An output RDD has partitions with records that originate from a single partition in the parent RDD. Only a limited subset of partitions used to calculate the result.

                                                                                                                                                                              Spark groups narrow transformations as a stage which is called pipelining.

                                                                                                                                                                              === [[wide-transformations]] Wide Transformations

                                                                                                                                                                              Wide transformations are the result of groupByKey and reduceByKey. The data required to compute the records in a single partition may reside in many partitions of the parent RDD.

                                                                                                                                                                              NOTE: Wide transformations are also called shuffle transformations as they may or may not depend on a shuffle.

                                                                                                                                                                              All of the tuples with the same key must end up in the same partition, processed by the same task. To satisfy these operations, Spark must execute a RDD shuffle, which transfers data across cluster and results in a new stage with a new set of partitions.

                                                                                                                                                                              "},{"location":"rdd/spark-rdd-transformations/#zipwithindex","title":"zipWithIndex","text":""},{"location":"rdd/spark-rdd-transformations/#source-scala_12","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#zipwithindex-rddt-long","title":"zipWithIndex(): RDD[(T, Long)]

                                                                                                                                                                              zipWithIndex zips this RDD[T] with its element indices.

                                                                                                                                                                              ","text":""},{"location":"rdd/spark-rdd-transformations/#caution","title":"[CAUTION]","text":"

                                                                                                                                                                              If the number of partitions of the source RDD is greater than 1, it will submit an additional job to calculate start indices.

                                                                                                                                                                              "},{"location":"rdd/spark-rdd-transformations/#source-scala_13","title":"[source, scala]

                                                                                                                                                                              val onePartition = sc.parallelize(0 to 9, 1)

                                                                                                                                                                              scala> onePartition.partitions.length res0: Int = 1

                                                                                                                                                                              // no job submitted onePartition.zipWithIndex

                                                                                                                                                                              val eightPartitions = sc.parallelize(0 to 9, 8)

                                                                                                                                                                              scala> eightPartitions.partitions.length res1: Int = 8

                                                                                                                                                                              // submits a job eightPartitions.zipWithIndex

                                                                                                                                                                              .Spark job submitted by zipWithIndex transformation image::spark-transformations-zipWithIndex-webui.png[align=\"center\"] ====

                                                                                                                                                                              ","text":""},{"location":"rest/","title":"Index","text":"

                                                                                                                                                                              = Status REST API -- Monitoring Spark Applications Using REST API

                                                                                                                                                                              Status REST API is a collection of REST endpoints under /api/v1 URI path in the spark-api-UIRoot.md[root containers for application UI information]:

                                                                                                                                                                              • [[SparkUI]] spark-webui-SparkUI.md[SparkUI] - Application UI for an active Spark application (i.e. a Spark application that is still running)

                                                                                                                                                                              • [[HistoryServer]] spark-history-server:HistoryServer.md[HistoryServer] - Application UI for active and completed Spark applications (i.e. Spark applications that are still running or have already finished)

                                                                                                                                                                              Status REST API uses spark-api-ApiRootResource.md[ApiRootResource] main resource class that registers /api/v1 URI <>.

                                                                                                                                                                              [[paths]] .URI Paths [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Path | Description

                                                                                                                                                                              | [[applications]] applications | [[ApplicationListResource]] Delegates to the spark-api-ApplicationListResource.md[ApplicationListResource] resource class

                                                                                                                                                                              | [[applications_appId]] applications/\\{appId} | [[OneApplicationResource]] Delegates to the spark-api-OneApplicationResource.md[OneApplicationResource] resource class

                                                                                                                                                                              | [[version]] version | Creates a VersionInfo with the current version of Spark |===

                                                                                                                                                                              Status REST API uses the following components:

                                                                                                                                                                              • https://jersey.github.io/[Jersey RESTful Web Services framework] with support for the https://github.com/jax-rs[Java API for RESTful Web Services] (JAX-RS API)

                                                                                                                                                                              • https://www.eclipse.org/jetty/[Eclipse Jetty] as the lightweight HTTP server and the https://jcp.org/en/jsr/detail?id=369[Java Servlet] container

                                                                                                                                                                              "},{"location":"rest/AbstractApplicationResource/","title":"AbstractApplicationResource","text":"

                                                                                                                                                                              == [[AbstractApplicationResource]] AbstractApplicationResource

                                                                                                                                                                              AbstractApplicationResource is a spark-api-BaseAppResource.md[BaseAppResource] with a set of <> that are common across <>.

                                                                                                                                                                              // start spark-shell\n$ http http://localhost:4040/api/v1/applications\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 257\nContent-Type: application/json\nDate: Tue, 05 Jun 2018 18:46:32 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[\n    {\n        \"attempts\": [\n            {\n                \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n                \"completed\": false,\n                \"duration\": 0,\n                \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n                \"endTimeEpoch\": -1,\n                \"lastUpdated\": \"2018-06-05T15:04:48.328GMT\",\n                \"lastUpdatedEpoch\": 1528211088328,\n                \"sparkUser\": \"jacek\",\n                \"startTime\": \"2018-06-05T15:04:48.328GMT\",\n                \"startTimeEpoch\": 1528211088328\n            }\n        ],\n        \"id\": \"local-1528211089216\",\n        \"name\": \"Spark shell\"\n    }\n]\n\n$ http http://localhost:4040/api/v1/applications/local-1528211089216/storage/rdd\nHTTP/1.1 200 OK\nContent-Length: 3\nContent-Type: application/json\nDate: Tue, 05 Jun 2018 18:48:00 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[]\n\n// Execute the following query in spark-shell\nspark.range(5).cache.count\n\n$ http http://localhost:4040/api/v1/applications/local-1528211089216/storage/rdd\n// output omitted for brevity\n

                                                                                                                                                                              [[implementations]] .AbstractApplicationResources [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | AbstractApplicationResource | Description

                                                                                                                                                                              | spark-api-OneApplicationResource.md[OneApplicationResource] | [[OneApplicationResource]] Handles applications/appId requests

                                                                                                                                                                              | spark-api-OneApplicationAttemptResource.md[OneApplicationAttemptResource] | [[OneApplicationAttemptResource]] |===

                                                                                                                                                                              [[paths]] .AbstractApplicationResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description

                                                                                                                                                                              | allexecutors | GET | <>

                                                                                                                                                                              | environment | GET | <>

                                                                                                                                                                              | executors | GET | <>

                                                                                                                                                                              | jobs | GET | <>

                                                                                                                                                                              | jobs/{jobId: \\\\d+} | GET | <>

                                                                                                                                                                              | logs | GET | <> stages <>

                                                                                                                                                                              | storage/rdd/{rddId: \\\\d+} | GET | <>

                                                                                                                                                                              | [[storage_rdd]] storage/rdd | GET | <> |===

                                                                                                                                                                              === [[rddList]] rddList Method

                                                                                                                                                                              "},{"location":"rest/AbstractApplicationResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#rddlist-seqrddstorageinfo","title":"rddList(): Seq[RDDStorageInfo]","text":"

                                                                                                                                                                              rddList...FIXME

                                                                                                                                                                              NOTE: rddList is used when...FIXME

                                                                                                                                                                              === [[environmentInfo]] environmentInfo Method

                                                                                                                                                                              "},{"location":"rest/AbstractApplicationResource/#source-scala_1","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#environmentinfo-applicationenvironmentinfo","title":"environmentInfo(): ApplicationEnvironmentInfo","text":"

                                                                                                                                                                              environmentInfo...FIXME

                                                                                                                                                                              NOTE: environmentInfo is used when...FIXME

                                                                                                                                                                              === [[rddData]] rddData Method

                                                                                                                                                                              "},{"location":"rest/AbstractApplicationResource/#source-scala_2","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#rdddatapathparamrddid-rddid-int-rddstorageinfo","title":"rddData(@PathParam(\"rddId\") rddId: Int): RDDStorageInfo","text":"

                                                                                                                                                                              rddData...FIXME

                                                                                                                                                                              NOTE: rddData is used when...FIXME

                                                                                                                                                                              === [[allExecutorList]] allExecutorList Method

                                                                                                                                                                              "},{"location":"rest/AbstractApplicationResource/#source-scala_3","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#allexecutorlist-seqexecutorsummary","title":"allExecutorList(): Seq[ExecutorSummary]","text":"

                                                                                                                                                                              allExecutorList...FIXME

                                                                                                                                                                              NOTE: allExecutorList is used when...FIXME

                                                                                                                                                                              === [[executorList]] executorList Method

                                                                                                                                                                              "},{"location":"rest/AbstractApplicationResource/#source-scala_4","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#executorlist-seqexecutorsummary","title":"executorList(): Seq[ExecutorSummary]","text":"

                                                                                                                                                                              executorList...FIXME

                                                                                                                                                                              NOTE: executorList is used when...FIXME

                                                                                                                                                                              === [[oneJob]] oneJob Method

                                                                                                                                                                              "},{"location":"rest/AbstractApplicationResource/#source-scala_5","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#onejobpathparamjobid-jobid-int-jobdata","title":"oneJob(@PathParam(\"jobId\") jobId: Int): JobData","text":"

                                                                                                                                                                              oneJob...FIXME

                                                                                                                                                                              NOTE: oneJob is used when...FIXME

                                                                                                                                                                              === [[jobsList]] jobsList Method

                                                                                                                                                                              "},{"location":"rest/AbstractApplicationResource/#source-scala_6","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#jobslistqueryparamstatus-statuses-jlistjobexecutionstatus-seqjobdata","title":"jobsList(@QueryParam(\"status\") statuses: JList[JobExecutionStatus]): Seq[JobData]","text":"

                                                                                                                                                                              jobsList...FIXME

                                                                                                                                                                              NOTE: jobsList is used when...FIXME

                                                                                                                                                                              "},{"location":"rest/ApiRequestContext/","title":"ApiRequestContext","text":"

                                                                                                                                                                              == [[ApiRequestContext]] ApiRequestContext

                                                                                                                                                                              ApiRequestContext is the <> of...FIXME

                                                                                                                                                                              [[contract]] [source, scala]

                                                                                                                                                                              package org.apache.spark.status.api.v1

                                                                                                                                                                              trait ApiRequestContext { // only required methods that have no implementation // the others follow @Context var servletContext: ServletContext = _

                                                                                                                                                                              @Context var httpRequest: HttpServletRequest = _ }

                                                                                                                                                                              NOTE: ApiRequestContext is a private[v1] contract.

                                                                                                                                                                              .ApiRequestContext Contract [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                              | httpRequest | [[httpRequest]] Java Servlets' HttpServletRequest

                                                                                                                                                                              Used when...FIXME

                                                                                                                                                                              | servletContext | [[servletContext]] Java Servlets' ServletContext

                                                                                                                                                                              Used when...FIXME |===

                                                                                                                                                                              [[implementations]] .ApiRequestContexts [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | ApiRequestContext | Description

                                                                                                                                                                              | spark-api-ApiRootResource.md[ApiRootResource] | [[ApiRootResource]]

                                                                                                                                                                              | ApiStreamingApp | [[ApiStreamingApp]]

                                                                                                                                                                              | spark-api-ApplicationListResource.md[ApplicationListResource] | [[ApplicationListResource]]

                                                                                                                                                                              | spark-api-BaseAppResource.md[BaseAppResource] | [[BaseAppResource]]

                                                                                                                                                                              | SecurityFilter | [[SecurityFilter]] |===

                                                                                                                                                                              === [[uiRoot]] Getting Current UIRoot -- uiRoot Method

                                                                                                                                                                              "},{"location":"rest/ApiRequestContext/#source-scala","title":"[source, scala]","text":""},{"location":"rest/ApiRequestContext/#uiroot-uiroot","title":"uiRoot: UIRoot","text":"

                                                                                                                                                                              uiRoot simply requests UIRootFromServletContext to spark-api-UIRootFromServletContext.md#getUiRoot[get the current UIRoot] (for the given <>).

                                                                                                                                                                              NOTE: uiRoot is used when...FIXME

                                                                                                                                                                              "},{"location":"rest/ApiRootResource/","title":"ApiRootResource","text":"

                                                                                                                                                                              == [[ApiRootResource]] ApiRootResource -- /api/v1 URI Handler

                                                                                                                                                                              ApiRootResource is the spark-api-ApiRequestContext.md[ApiRequestContext] for the /v1 URI path.

                                                                                                                                                                              ApiRootResource uses @Path(\"/v1\") annotation at the class level. It is a partial URI path template relative to the base URI of the server on which the resource is deployed, the context root of the application, and the URL pattern to which the JAX-RS runtime responds.

                                                                                                                                                                              TIP: Learn more about @Path annotation in https://docs.oracle.com/cd/E19798-01/821-1841/6nmq2cp26/index.html[The @Path Annotation and URI Path Templates].

                                                                                                                                                                              ApiRootResource <> the /api/* context handler (with the REST resources and providers in org.apache.spark.status.api.v1 package).

                                                                                                                                                                              With the @Path(\"/v1\") annotation and after <> the /api/* context handler, ApiRootResource serves HTTP requests for <> under the /api/v1 URI paths for spark-webui-SparkUI.md#initialize[SparkUI] and spark-history-server:HistoryServer.md#initialize[HistoryServer].

                                                                                                                                                                              ApiRootResource gives the metrics of a Spark application in JSON format (using JAX-RS API).

                                                                                                                                                                              // start spark-shell\n$ http http://localhost:4040/api/v1/applications\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 257\nContent-Type: application/json\nDate: Tue, 05 Jun 2018 18:36:16 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[\n    {\n        \"attempts\": [\n            {\n                \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n                \"completed\": false,\n                \"duration\": 0,\n                \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n                \"endTimeEpoch\": -1,\n                \"lastUpdated\": \"2018-06-05T15:04:48.328GMT\",\n                \"lastUpdatedEpoch\": 1528211088328,\n                \"sparkUser\": \"jacek\",\n                \"startTime\": \"2018-06-05T15:04:48.328GMT\",\n                \"startTimeEpoch\": 1528211088328\n            }\n        ],\n        \"id\": \"local-1528211089216\",\n        \"name\": \"Spark shell\"\n    }\n]\n\n// Fixed in Spark 2.3.1\n// https://issues.apache.org/jira/browse/SPARK-24188\n$ http http://localhost:4040/api/v1/version\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 43\nContent-Type: application/json\nDate: Thu, 14 Jun 2018 08:19:06 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n{\n    \"spark\": \"2.3.1\"\n}\n

                                                                                                                                                                              [[paths]] .ApiRootResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description

                                                                                                                                                                              [[applications]] applications [[ApplicationListResource]] Delegates to the spark-api-ApplicationListResource.md[ApplicationListResource] resource class [[applications_appId]] applications/\\{appId} [[OneApplicationResource]] Delegates to the spark-api-OneApplicationResource.md[OneApplicationResource] resource class

                                                                                                                                                                              | [[version]] version | GET | Creates a VersionInfo with the current version of Spark |===

                                                                                                                                                                              === [[getServletHandler]] Creating /api/* Context Handler -- getServletHandler Method

                                                                                                                                                                              "},{"location":"rest/ApiRootResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/ApiRootResource/#getservlethandleruiroot-uiroot-servletcontexthandler","title":"getServletHandler(uiRoot: UIRoot): ServletContextHandler","text":"

                                                                                                                                                                              getServletHandler creates a Jetty ServletContextHandler for /api context path.

                                                                                                                                                                              NOTE: The Jetty ServletContextHandler created does not support HTTP sessions as REST API is stateless.

                                                                                                                                                                              getServletHandler creates a Jetty ServletHolder with the resources and providers in org.apache.spark.status.api.v1 package. It then registers the ServletHolder to serve /* context path (under the ServletContextHandler for /api).

                                                                                                                                                                              getServletHandler requests UIRootFromServletContext to spark-api-UIRootFromServletContext.md#setUiRoot[setUiRoot] with the ServletContextHandler and the input spark-api-UIRoot.md[UIRoot].

                                                                                                                                                                              NOTE: getServletHandler is used when spark-webui-SparkUI.md#initialize[SparkUI] and spark-history-server:HistoryServer.md#initialize[HistoryServer] are requested to initialize.

                                                                                                                                                                              "},{"location":"rest/ApplicationListResource/","title":"ApplicationListResource","text":"

                                                                                                                                                                              == [[ApplicationListResource]] ApplicationListResource -- applications URI Handler

                                                                                                                                                                              ApplicationListResource is a spark-api-ApiRequestContext.md[ApiRequestContext] that spark-api-ApiRootResource.md#applications[ApiRootResource] uses to handle <> URI path.

                                                                                                                                                                              [[paths]] .ApplicationListResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description

                                                                                                                                                                              | [[root]] / | GET | <> |===

                                                                                                                                                                              // start spark-shell\n// there should be a single Spark application -- the spark-shell itself\n$ http http://localhost:4040/api/v1/applications\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 255\nContent-Type: application/json\nDate: Wed, 06 Jun 2018 12:40:33 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[\n    {\n        \"attempts\": [\n            {\n                \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n                \"completed\": false,\n                \"duration\": 0,\n                \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n                \"endTimeEpoch\": -1,\n                \"lastUpdated\": \"2018-06-06T12:30:19.220GMT\",\n                \"lastUpdatedEpoch\": 1528288219220,\n                \"sparkUser\": \"jacek\",\n                \"startTime\": \"2018-06-06T12:30:19.220GMT\",\n                \"startTimeEpoch\": 1528288219220\n            }\n        ],\n        \"id\": \"local-1528288219790\",\n        \"name\": \"Spark shell\"\n    }\n]\n

                                                                                                                                                                              === [[isAttemptInRange]] isAttemptInRange Internal Method

                                                                                                                                                                              "},{"location":"rest/ApplicationListResource/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                              isAttemptInRange( attempt: ApplicationAttemptInfo, minStartDate: SimpleDateParam, maxStartDate: SimpleDateParam, minEndDate: SimpleDateParam, maxEndDate: SimpleDateParam, anyRunning: Boolean): Boolean

                                                                                                                                                                              isAttemptInRange...FIXME

                                                                                                                                                                              NOTE: isAttemptInRange is used exclusively when ApplicationListResource is requested to handle a <> HTTP request.

                                                                                                                                                                              === [[appList]] appList Method

                                                                                                                                                                              "},{"location":"rest/ApplicationListResource/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                              appList( @QueryParam(\"status\") status: JList[ApplicationStatus], @DefaultValue(\"2010-01-01\") @QueryParam(\"minDate\") minDate: SimpleDateParam, @DefaultValue(\"3000-01-01\") @QueryParam(\"maxDate\") maxDate: SimpleDateParam, @DefaultValue(\"2010-01-01\") @QueryParam(\"minEndDate\") minEndDate: SimpleDateParam, @DefaultValue(\"3000-01-01\") @QueryParam(\"maxEndDate\") maxEndDate: SimpleDateParam, @QueryParam(\"limit\") limit: Integer) : Iterator[ApplicationInfo]

                                                                                                                                                                              appList...FIXME

                                                                                                                                                                              NOTE: appList is used when...FIXME

                                                                                                                                                                              "},{"location":"rest/BaseAppResource/","title":"BaseAppResource","text":"

                                                                                                                                                                              == [[BaseAppResource]] BaseAppResource

                                                                                                                                                                              BaseAppResource is the contract of spark-api-ApiRequestContext.md[ApiRequestContexts] that can <> and use <> and <> path parameters in URI paths.

                                                                                                                                                                              [[path-params]] .BaseAppResource's Path Parameters [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                                                              | appId | [[appId]] @PathParam(\"appId\")

                                                                                                                                                                              Used when...FIXME

                                                                                                                                                                              | attemptId | [[attemptId]] @PathParam(\"attemptId\")

                                                                                                                                                                              Used when...FIXME |===

                                                                                                                                                                              [[implementations]] .BaseAppResources [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | BaseAppResource | Description

                                                                                                                                                                              | spark-api-AbstractApplicationResource.md[AbstractApplicationResource] | [[AbstractApplicationResource]]

                                                                                                                                                                              | BaseStreamingAppResource | [[BaseStreamingAppResource]]

                                                                                                                                                                              | spark-api-StagesResource.md[StagesResource] | [[StagesResource]] |===

                                                                                                                                                                              NOTE: BaseAppResource is a private[v1] contract.

                                                                                                                                                                              === [[withUI]] withUI Method

                                                                                                                                                                              "},{"location":"rest/BaseAppResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/BaseAppResource/#withuit-t","title":"withUIT: T","text":"

                                                                                                                                                                              withUI...FIXME

                                                                                                                                                                              NOTE: withUI is used when...FIXME

                                                                                                                                                                              "},{"location":"rest/OneApplicationAttemptResource/","title":"OneApplicationAttemptResource","text":"

                                                                                                                                                                              == [[OneApplicationAttemptResource]] OneApplicationAttemptResource

                                                                                                                                                                              OneApplicationAttemptResource is a spark-api-AbstractApplicationResource.md[AbstractApplicationResource] (and so a spark-api-ApiRequestContext.md[ApiRequestContext] indirectly).

                                                                                                                                                                              OneApplicationAttemptResource is used when AbstractApplicationResource is requested to spark-api-AbstractApplicationResource.md#applicationAttempt[applicationAttempt].

                                                                                                                                                                              [[paths]] .OneApplicationAttemptResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description

                                                                                                                                                                              | [[root]] / | GET | <> |===

                                                                                                                                                                              // start spark-shell\n// there should be a single Spark application -- the spark-shell itself\n// CAUTION: FIXME Demo of OneApplicationAttemptResource in Action\n

                                                                                                                                                                              === [[getAttempt]] getAttempt Method

                                                                                                                                                                              "},{"location":"rest/OneApplicationAttemptResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/OneApplicationAttemptResource/#getattempt-applicationattemptinfo","title":"getAttempt(): ApplicationAttemptInfo","text":"

                                                                                                                                                                              getAttempt requests the spark-api-ApiRequestContext.md#uiRoot[UIRoot] for the spark-api-UIRoot.md#getApplicationInfo[application info] (given the spark-api-BaseAppResource.md#appId[appId]) and finds the spark-api-BaseAppResource.md#attemptId[attemptId] among the available attempts.

                                                                                                                                                                              NOTE: spark-api-BaseAppResource.md#appId[appId] and spark-api-BaseAppResource.md#attemptId[attemptId] are path parameters.

                                                                                                                                                                              In the end, getAttempt returns the ApplicationAttemptInfo if available or reports a NotFoundException:

                                                                                                                                                                              unknown app [appId], attempt [attemptId]\n
                                                                                                                                                                              "},{"location":"rest/OneApplicationResource/","title":"OneApplicationResource","text":"

                                                                                                                                                                              == [[OneApplicationResource]] OneApplicationResource -- applications/appId URI Handler

                                                                                                                                                                              OneApplicationResource is a spark-api-AbstractApplicationResource.md[AbstractApplicationResource] (and so a spark-api-ApiRequestContext.md[ApiRequestContext] indirectly) that spark-api-ApiRootResource.md#applications_appId[ApiRootResource] uses to handle <> URI path.

                                                                                                                                                                              [[paths]] .OneApplicationResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description

                                                                                                                                                                              | [[root]] / | GET | <> |===

                                                                                                                                                                              // start spark-shell\n// there should be a single Spark application -- the spark-shell itself\n$ http http://localhost:4040/api/v1/applications\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 255\nContent-Type: application/json\nDate: Wed, 06 Jun 2018 12:40:33 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[\n    {\n        \"attempts\": [\n            {\n                \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n                \"completed\": false,\n                \"duration\": 0,\n                \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n                \"endTimeEpoch\": -1,\n                \"lastUpdated\": \"2018-06-06T12:30:19.220GMT\",\n                \"lastUpdatedEpoch\": 1528288219220,\n                \"sparkUser\": \"jacek\",\n                \"startTime\": \"2018-06-06T12:30:19.220GMT\",\n                \"startTimeEpoch\": 1528288219220\n            }\n        ],\n        \"id\": \"local-1528288219790\",\n        \"name\": \"Spark shell\"\n    }\n]\n\n$ http http://localhost:4040/api/v1/applications/local-1528288219790\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 255\nContent-Type: application/json\nDate: Wed, 06 Jun 2018 12:41:43 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n{\n    \"attempts\": [\n        {\n            \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n            \"completed\": false,\n            \"duration\": 0,\n            \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n            \"endTimeEpoch\": -1,\n            \"lastUpdated\": \"2018-06-06T12:30:19.220GMT\",\n            \"lastUpdatedEpoch\": 1528288219220,\n            \"sparkUser\": \"jacek\",\n            \"startTime\": \"2018-06-06T12:30:19.220GMT\",\n            \"startTimeEpoch\": 1528288219220\n        }\n    ],\n    \"id\": \"local-1528288219790\",\n    \"name\": \"Spark shell\"\n}\n

                                                                                                                                                                              === [[getApp]] getApp Method

                                                                                                                                                                              "},{"location":"rest/OneApplicationResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/OneApplicationResource/#getapp-applicationinfo","title":"getApp(): ApplicationInfo","text":"

                                                                                                                                                                              getApp requests the spark-api-ApiRequestContext.md#uiRoot[UIRoot] for the spark-api-UIRoot.md#getApplicationInfo[application info] (given the spark-api-BaseAppResource.md#appId[appId]).

                                                                                                                                                                              In the end, getApp returns the ApplicationInfo if available or reports a NotFoundException:

                                                                                                                                                                              unknown app: [appId]\n
                                                                                                                                                                              "},{"location":"rest/StagesResource/","title":"StagesResource","text":"

                                                                                                                                                                              == [[StagesResource]] StagesResource

                                                                                                                                                                              StagesResource is...FIXME

                                                                                                                                                                              [[paths]] .StagesResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description

                                                                                                                                                                              | | GET | <>

                                                                                                                                                                              | {stageId: \\d+} | GET | <>

                                                                                                                                                                              | {stageId: \\d+}/{stageAttemptId: \\d+} | GET | <>

                                                                                                                                                                              | {stageId: \\d+}/{stageAttemptId: \\d+}/taskSummary | GET | <>

                                                                                                                                                                              | {stageId: \\d+}/{stageAttemptId: \\d+}/taskList | GET | <> |===

                                                                                                                                                                              === [[stageList]] stageList Method

                                                                                                                                                                              "},{"location":"rest/StagesResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/StagesResource/#stagelistqueryparamstatus-statuses-jliststagestatus-seqstagedata","title":"stageList(@QueryParam(\"status\") statuses: JList[StageStatus]): Seq[StageData]","text":"

                                                                                                                                                                              stageList...FIXME

                                                                                                                                                                              NOTE: stageList is used when...FIXME

                                                                                                                                                                              === [[stageData]] stageData Method

                                                                                                                                                                              "},{"location":"rest/StagesResource/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                              stageData( @PathParam(\"stageId\") stageId: Int, @QueryParam(\"details\") @DefaultValue(\"true\") details: Boolean): Seq[StageData]

                                                                                                                                                                              stageData...FIXME

                                                                                                                                                                              NOTE: stageData is used when...FIXME

                                                                                                                                                                              === [[oneAttemptData]] oneAttemptData Method

                                                                                                                                                                              "},{"location":"rest/StagesResource/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                                              oneAttemptData( @PathParam(\"stageId\") stageId: Int, @PathParam(\"stageAttemptId\") stageAttemptId: Int, @QueryParam(\"details\") @DefaultValue(\"true\") details: Boolean): StageData

                                                                                                                                                                              oneAttemptData...FIXME

                                                                                                                                                                              NOTE: oneAttemptData is used when...FIXME

                                                                                                                                                                              === [[taskSummary]] taskSummary Method

                                                                                                                                                                              "},{"location":"rest/StagesResource/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                              taskSummary( @PathParam(\"stageId\") stageId: Int, @PathParam(\"stageAttemptId\") stageAttemptId: Int, @DefaultValue(\"0.05,0.25,0.5,0.75,0.95\") @QueryParam(\"quantiles\") quantileString: String) : TaskMetricDistributions

                                                                                                                                                                              taskSummary...FIXME

                                                                                                                                                                              NOTE: taskSummary is used when...FIXME

                                                                                                                                                                              === [[taskList]] taskList Method

                                                                                                                                                                              "},{"location":"rest/StagesResource/#source-scala_4","title":"[source, scala]","text":"

                                                                                                                                                                              taskList( @PathParam(\"stageId\") stageId: Int, @PathParam(\"stageAttemptId\") stageAttemptId: Int, @DefaultValue(\"0\") @QueryParam(\"offset\") offset: Int, @DefaultValue(\"20\") @QueryParam(\"length\") length: Int, @DefaultValue(\"ID\") @QueryParam(\"sortBy\") sortBy: TaskSorting): Seq[TaskData]

                                                                                                                                                                              taskList...FIXME

                                                                                                                                                                              NOTE: taskList is used when...FIXME

                                                                                                                                                                              "},{"location":"rest/UIRoot/","title":"UIRoot","text":"

                                                                                                                                                                              == [[UIRoot]] UIRoot -- Contract for Root Contrainers of Application UI Information

                                                                                                                                                                              UIRoot is the <> of the <>.

                                                                                                                                                                              [[contract]] [source, scala]

                                                                                                                                                                              package org.apache.spark.status.api.v1

                                                                                                                                                                              trait UIRoot { // only required methods that have no implementation // the others follow def withSparkUIT(fn: SparkUI => T): T def getApplicationInfoList: Iterator[ApplicationInfo] def getApplicationInfo(appId: String): Option[ApplicationInfo] def securityManager: SecurityManager }

                                                                                                                                                                              NOTE: UIRoot is a private[spark] contract.

                                                                                                                                                                              .UIRoot Contract [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                              | getApplicationInfo | [[getApplicationInfo]] Used when...FIXME

                                                                                                                                                                              | getApplicationInfoList | [[getApplicationInfoList]] Used when...FIXME

                                                                                                                                                                              | securityManager | [[securityManager]] Used when...FIXME

                                                                                                                                                                              | withSparkUI | [[withSparkUI]] Used exclusively when BaseAppResource is requested spark-api-BaseAppResource.md#withUI[withUI] |===

                                                                                                                                                                              [[implementations]] .UIRoots [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | UIRoot | Description

                                                                                                                                                                              | spark-history-server:HistoryServer.md[HistoryServer] | [[HistoryServer]] Application UI for active and completed Spark applications (i.e. Spark applications that are still running or have already finished)

                                                                                                                                                                              | spark-webui-SparkUI.md[SparkUI] | [[SparkUI]] Application UI for an active Spark application (i.e. a Spark application that is still running) |===

                                                                                                                                                                              === [[writeEventLogs]] writeEventLogs Method

                                                                                                                                                                              "},{"location":"rest/UIRoot/#source-scala","title":"[source, scala]","text":""},{"location":"rest/UIRoot/#writeeventlogsappid-string-attemptid-optionstring-zipstream-zipoutputstream-unit","title":"writeEventLogs(appId: String, attemptId: Option[String], zipStream: ZipOutputStream): Unit","text":"

                                                                                                                                                                              writeEventLogs...FIXME

                                                                                                                                                                              NOTE: writeEventLogs is used when...FIXME

                                                                                                                                                                              "},{"location":"rest/UIRootFromServletContext/","title":"UIRootFromServletContext","text":"

                                                                                                                                                                              == [[UIRootFromServletContext]] UIRootFromServletContext

                                                                                                                                                                              UIRootFromServletContext manages the current <> object in a Jetty ContextHandler.

                                                                                                                                                                              [[attribute]] UIRootFromServletContext uses its canonical name for the context attribute that is used to <> or <> the current spark-api-UIRoot.md[UIRoot] object (in Jetty's ContextHandler).

                                                                                                                                                                              NOTE: https://www.eclipse.org/jetty/javadoc/current/org/eclipse/jetty/server/handler/ContextHandler.html[ContextHandler] is the environment for multiple Jetty Handlers, e.g. URI context path, class loader, static resource base.

                                                                                                                                                                              In essence, UIRootFromServletContext is simply a \"bridge\" between two worlds, Spark's spark-api-UIRoot.md[UIRoot] and Jetty's ContextHandler.

                                                                                                                                                                              === [[setUiRoot]] setUiRoot Method

                                                                                                                                                                              "},{"location":"rest/UIRootFromServletContext/#source-scala","title":"[source, scala]","text":""},{"location":"rest/UIRootFromServletContext/#setuirootcontexthandler-contexthandler-uiroot-uiroot-unit","title":"setUiRoot(contextHandler: ContextHandler, uiRoot: UIRoot): Unit","text":"

                                                                                                                                                                              setUiRoot...FIXME

                                                                                                                                                                              NOTE: setUiRoot is used exclusively when ApiRootResource is requested to spark-api-ApiRootResource.md#getServletHandler[register /api/* context handler].

                                                                                                                                                                              === [[getUiRoot]] getUiRoot Method

                                                                                                                                                                              "},{"location":"rest/UIRootFromServletContext/#source-scala_1","title":"[source, scala]","text":""},{"location":"rest/UIRootFromServletContext/#getuirootcontext-servletcontext-uiroot","title":"getUiRoot(context: ServletContext): UIRoot","text":"

                                                                                                                                                                              getUiRoot...FIXME

                                                                                                                                                                              NOTE: getUiRoot is used exclusively when ApiRequestContext is requested for the current spark-api-ApiRequestContext.md#uiRoot[UIRoot].

                                                                                                                                                                              "},{"location":"rpc/","title":"RPC System","text":"

                                                                                                                                                                              RPC System is a communication system of Spark services.

                                                                                                                                                                              The main abstractions are RpcEnv and RpcEndpoint.

                                                                                                                                                                              "},{"location":"rpc/NettyRpcEnv/","title":"NettyRpcEnv","text":"

                                                                                                                                                                              NettyRpcEnv is an RpcEnv that uses Netty (\"an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers & clients\").

                                                                                                                                                                              "},{"location":"rpc/NettyRpcEnv/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                              NettyRpcEnv takes the following to be created:

                                                                                                                                                                              • SparkConf
                                                                                                                                                                              • JavaSerializerInstance
                                                                                                                                                                              • Host Name
                                                                                                                                                                              • SecurityManager
                                                                                                                                                                              • Number of CPU Cores

                                                                                                                                                                                NettyRpcEnv is created\u00a0when:

                                                                                                                                                                                • NettyRpcEnvFactory is requested to create an RpcEnv
                                                                                                                                                                                "},{"location":"rpc/NettyRpcEnvFactory/","title":"NettyRpcEnvFactory","text":"

                                                                                                                                                                                NettyRpcEnvFactory is an RpcEnvFactory for a Netty-based RpcEnv.

                                                                                                                                                                                "},{"location":"rpc/NettyRpcEnvFactory/#creating-rpcenv","title":"Creating RpcEnv
                                                                                                                                                                                create(\n  config: RpcEnvConfig): RpcEnv\n

                                                                                                                                                                                create creates a JavaSerializerInstance (using a JavaSerializer).

                                                                                                                                                                                Note

                                                                                                                                                                                KryoSerializer is not supported.

                                                                                                                                                                                create creates a rpc:NettyRpcEnv.md[] with the JavaSerializerInstance. create uses the given rpc:RpcEnvConfig.md[] for the rpc:RpcEnvConfig.md#advertiseAddress[advertised address], rpc:RpcEnvConfig.md#securityManager[SecurityManager] and rpc:RpcEnvConfig.md#numUsableCores[number of CPU cores].

                                                                                                                                                                                create returns the NettyRpcEnv unless the rpc:RpcEnvConfig.md#clientMode[clientMode] is turned off (server mode).

                                                                                                                                                                                In server mode, create attempts to start the NettyRpcEnv on a given port. create uses the given rpc:RpcEnvConfig.md[] for the rpc:RpcEnvConfig.md#port[port], rpc:RpcEnvConfig.md#bindAddress[bind address], and rpc:RpcEnvConfig.md#name[name]. With the port, the NettyRpcEnv is requested to rpc:NettyRpcEnv.md#startServer[start a server].

                                                                                                                                                                                create is part of the rpc:RpcEnvFactory.md#create[RpcEnvFactory] abstraction.

                                                                                                                                                                                ","text":""},{"location":"rpc/RpcAddress/","title":"RpcAddress","text":"

                                                                                                                                                                                RpcAddress is a logical address of an RPC system, with hostname and port.

                                                                                                                                                                                RpcAddress can be encoded as a Spark URL in the format of spark://host:port.

                                                                                                                                                                                "},{"location":"rpc/RpcAddress/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                RpcAddress takes the following to be created:

                                                                                                                                                                                • Host
                                                                                                                                                                                • Port"},{"location":"rpc/RpcAddress/#creating-rpcaddress-based-on-spark-url","title":"Creating RpcAddress based on Spark URL
                                                                                                                                                                                  fromSparkURL(\n  sparkUrl: String): RpcAddress\n

                                                                                                                                                                                  fromSparkURL extract a host and a port from the input Spark URL and creates an RpcAddress.

                                                                                                                                                                                  fromSparkURL\u00a0is used when:

                                                                                                                                                                                  • StandaloneAppClient (Spark Standalone) is created
                                                                                                                                                                                  • ClientApp (Spark Standalone) is requested to start
                                                                                                                                                                                  • Worker (Spark Standalone) is requested to startRpcEnvAndEndpoint
                                                                                                                                                                                  ","text":""},{"location":"rpc/RpcEndpoint/","title":"RpcEndpoint","text":"

                                                                                                                                                                                  RpcEndpoint is an abstraction of RPC endpoints that are registered to an RpcEnv to process one- (fire-and-forget) or two-way messages.

                                                                                                                                                                                  "},{"location":"rpc/RpcEndpoint/#contract","title":"Contract","text":""},{"location":"rpc/RpcEndpoint/#onconnected","title":"onConnected
                                                                                                                                                                                  onConnected(\n  remoteAddress: RpcAddress): Unit\n

                                                                                                                                                                                  Invoked when RpcAddress is connected to the current node

                                                                                                                                                                                  Used when:

                                                                                                                                                                                  • Inbox is requested to process a RemoteProcessConnected message
                                                                                                                                                                                  ","text":""},{"location":"rpc/RpcEndpoint/#ondisconnected","title":"onDisconnected
                                                                                                                                                                                  onDisconnected(\n  remoteAddress: RpcAddress): Unit\n

                                                                                                                                                                                  Used when:

                                                                                                                                                                                  • Inbox is requested to process a RemoteProcessDisconnected message
                                                                                                                                                                                  ","text":""},{"location":"rpc/RpcEndpoint/#onerror","title":"onError
                                                                                                                                                                                  onError(\n  cause: Throwable): Unit\n

                                                                                                                                                                                  Used when:

                                                                                                                                                                                  • Inbox is requested to process a message that threw a NonFatal exception
                                                                                                                                                                                  ","text":""},{"location":"rpc/RpcEndpoint/#onnetworkerror","title":"onNetworkError
                                                                                                                                                                                  onNetworkError(\n  cause: Throwable,\n  remoteAddress: RpcAddress): Unit\n

                                                                                                                                                                                  Used when:

                                                                                                                                                                                  • Inbox is requested to process a RemoteProcessConnectionError message
                                                                                                                                                                                  ","text":""},{"location":"rpc/RpcEndpoint/#onstart","title":"onStart
                                                                                                                                                                                  onStart(): Unit\n

                                                                                                                                                                                  Used when:

                                                                                                                                                                                  • Inbox is requested to process an OnStart message
                                                                                                                                                                                  ","text":""},{"location":"rpc/RpcEndpoint/#onstop","title":"onStop
                                                                                                                                                                                  onStop(): Unit\n

                                                                                                                                                                                  Used when:

                                                                                                                                                                                  • Inbox is requested to process an OnStop message
                                                                                                                                                                                  ","text":""},{"location":"rpc/RpcEndpoint/#processing-one-way-messages","title":"Processing One-Way Messages
                                                                                                                                                                                  receive: PartialFunction[Any, Unit]\n

                                                                                                                                                                                  Used when:

                                                                                                                                                                                  • Inbox is requested to process an OneWayMessage message
                                                                                                                                                                                  ","text":""},{"location":"rpc/RpcEndpoint/#processing-two-way-messages","title":"Processing Two-Way Messages
                                                                                                                                                                                  receiveAndReply(\n  context: RpcCallContext): PartialFunction[Any, Unit]\n

                                                                                                                                                                                  Used when:

                                                                                                                                                                                  • Inbox is requested to process a RpcMessage message
                                                                                                                                                                                  ","text":""},{"location":"rpc/RpcEndpoint/#rpcenv","title":"RpcEnv
                                                                                                                                                                                  rpcEnv: RpcEnv\n

                                                                                                                                                                                  RpcEnv this RpcEndpoint is registered to

                                                                                                                                                                                  ","text":""},{"location":"rpc/RpcEndpoint/#implementations","title":"Implementations","text":"
                                                                                                                                                                                  • AMEndpoint
                                                                                                                                                                                  • IsolatedRpcEndpoint
                                                                                                                                                                                  • MapOutputTrackerMasterEndpoint
                                                                                                                                                                                  • OutputCommitCoordinatorEndpoint
                                                                                                                                                                                  • RpcEndpointVerifier
                                                                                                                                                                                  • ThreadSafeRpcEndpoint
                                                                                                                                                                                  • WorkerWatcher
                                                                                                                                                                                  • "},{"location":"rpc/RpcEndpoint/#self","title":"self
                                                                                                                                                                                    self: RpcEndpointRef\n

                                                                                                                                                                                    self requests the RpcEnv for the RpcEndpointRef of this RpcEndpoint.

                                                                                                                                                                                    self throws an IllegalArgumentException when the RpcEnv has not been initialized:

                                                                                                                                                                                    rpcEnv has not been initialized\n
                                                                                                                                                                                    ","text":""},{"location":"rpc/RpcEndpoint/#stopping-rpcendpoint","title":"Stopping RpcEndpoint
                                                                                                                                                                                    stop(): Unit\n

                                                                                                                                                                                    stop requests the RpcEnv to stop this RpcEndpoint

                                                                                                                                                                                    ","text":""},{"location":"rpc/RpcEndpointAddress/","title":"RpcEndpointAddress","text":"

                                                                                                                                                                                    = RpcEndpointAddress

                                                                                                                                                                                    RpcEndpointAddress is a logical address of an endpoint in an RPC system, with <> and name.

                                                                                                                                                                                    RpcEndpointAddress is in the format of spark://[name]@[rpcAddress.host]:[rpcAddress.port].

                                                                                                                                                                                    "},{"location":"rpc/RpcEndpointRef/","title":"RpcEndpointRef","text":"

                                                                                                                                                                                    RpcEndpointRef is a reference to a rpc:RpcEndpoint.md[RpcEndpoint] in a rpc:index.md[RpcEnv].

                                                                                                                                                                                    RpcEndpointRef is a serializable entity and so you can send it over a network or save it for later use (it can however be deserialized using the owning RpcEnv only).

                                                                                                                                                                                    A RpcEndpointRef has <> (a Spark URL), and a name.

                                                                                                                                                                                    You can send asynchronous one-way messages to the corresponding RpcEndpoint using <> method.

                                                                                                                                                                                    You can send a semi-synchronous message, i.e. \"subscribe\" to be notified when a response arrives, using ask method. You can also block the current calling thread for a response using askWithRetry method.

                                                                                                                                                                                    • spark.rpc.numRetries (default: 3) - the number of times to retry connection attempts.
                                                                                                                                                                                    • spark.rpc.retry.wait (default: 3s) - the number of milliseconds to wait on each retry.

                                                                                                                                                                                    It also uses rpc:index.md#endpoint-lookup-timeout[lookup timeouts].

                                                                                                                                                                                    == [[send]] send Method

                                                                                                                                                                                    CAUTION: FIXME

                                                                                                                                                                                    == [[askWithRetry]] askWithRetry Method

                                                                                                                                                                                    CAUTION: FIXME

                                                                                                                                                                                    "},{"location":"rpc/RpcEnv/","title":"RpcEnv","text":"

                                                                                                                                                                                    RpcEnv is an abstraction of RPC environments.

                                                                                                                                                                                    "},{"location":"rpc/RpcEnv/#contract","title":"Contract","text":""},{"location":"rpc/RpcEnv/#address","title":"address
                                                                                                                                                                                    address: RpcAddress\n

                                                                                                                                                                                    RpcAddress of this RPC environments

                                                                                                                                                                                    ","text":""},{"location":"rpc/RpcEnv/#asyncsetupendpointrefbyuri","title":"asyncSetupEndpointRefByURI
                                                                                                                                                                                    asyncSetupEndpointRefByURI(\n  uri: String): Future[RpcEndpointRef]\n

                                                                                                                                                                                    Looking up a RpcEndpointRef of the RPC endpoint by URI (asynchronously)

                                                                                                                                                                                    Used when:

                                                                                                                                                                                    • WorkerWatcher is created
                                                                                                                                                                                    • CoarseGrainedExecutorBackend is requested to onStart
                                                                                                                                                                                    • RpcEnv is requested to setupEndpointRefByURI
                                                                                                                                                                                    ","text":""},{"location":"rpc/RpcEnv/#awaittermination","title":"awaitTermination
                                                                                                                                                                                    awaitTermination(): Unit\n

                                                                                                                                                                                    Blocks the current thread till the RPC environment terminates

                                                                                                                                                                                    Used when:

                                                                                                                                                                                    • SparkEnv is requested to stop
                                                                                                                                                                                    • ClientApp (Spark Standalone) is requested to start
                                                                                                                                                                                    • LocalSparkCluster (Spark Standalone) is requested to stop
                                                                                                                                                                                    • Master (Spark Standalone) and Worker (Spark Standalone) are launched
                                                                                                                                                                                    • CoarseGrainedExecutorBackend is requested to run
                                                                                                                                                                                    ","text":""},{"location":"rpc/RpcEnv/#deserialize","title":"deserialize
                                                                                                                                                                                    deserialize[T](\n  deserializationAction: () => T): T\n

                                                                                                                                                                                    Used when:

                                                                                                                                                                                    • PersistenceEngine is requested to readPersistedData
                                                                                                                                                                                    • NettyRpcEnv is requested to deserialize
                                                                                                                                                                                    ","text":""},{"location":"rpc/RpcEnv/#endpointref","title":"endpointRef
                                                                                                                                                                                    endpointRef(\n  endpoint: RpcEndpoint): RpcEndpointRef\n

                                                                                                                                                                                    Used when:

                                                                                                                                                                                    • RpcEndpoint is requested for the RpcEndpointRef to itself
                                                                                                                                                                                    ","text":""},{"location":"rpc/RpcEnv/#rpcenvfileserver","title":"RpcEnvFileServer
                                                                                                                                                                                    fileServer: RpcEnvFileServer\n

                                                                                                                                                                                    RpcEnvFileServer of this RPC environment

                                                                                                                                                                                    Used when:

                                                                                                                                                                                    • SparkContext is requested to addFile, addJar and is created (and registers the REPL's output directory)
                                                                                                                                                                                    ","text":""},{"location":"rpc/RpcEnv/#openchannel","title":"openChannel
                                                                                                                                                                                    openChannel(\n  uri: String): ReadableByteChannel\n

                                                                                                                                                                                    Opens a channel to download a file at the given URI

                                                                                                                                                                                    Used when:

                                                                                                                                                                                    • Utils utility is used to doFetchFile
                                                                                                                                                                                    • ExecutorClassLoader is requested to getClassFileInputStreamFromSparkRPC
                                                                                                                                                                                    ","text":""},{"location":"rpc/RpcEnv/#setupendpoint","title":"setupEndpoint
                                                                                                                                                                                    setupEndpoint(\n  name: String,\n  endpoint: RpcEndpoint): RpcEndpointRef\n
                                                                                                                                                                                    ","text":""},{"location":"rpc/RpcEnv/#shutdown","title":"shutdown
                                                                                                                                                                                    shutdown(): Unit\n

                                                                                                                                                                                    Shuts down this RPC environment asynchronously (and to make sure this RpcEnv exits successfully, use awaitTermination)

                                                                                                                                                                                    Used when:

                                                                                                                                                                                    • SparkEnv is requested to stop
                                                                                                                                                                                    • LocalSparkCluster (Spark Standalone) is requested to stop
                                                                                                                                                                                    • DriverWrapper is launched
                                                                                                                                                                                    • CoarseGrainedExecutorBackend is launched
                                                                                                                                                                                    • NettyRpcEnvFactory is requested to create an RpcEnv (in server mode and failed to assign a port)
                                                                                                                                                                                    ","text":""},{"location":"rpc/RpcEnv/#stopping-rpcendpointref","title":"Stopping RpcEndpointRef
                                                                                                                                                                                    stop(\n  endpoint: RpcEndpointRef): Unit\n

                                                                                                                                                                                    Used when:

                                                                                                                                                                                    • SparkContext is requested to stop
                                                                                                                                                                                    • RpcEndpoint is requested to stop
                                                                                                                                                                                    • BlockManager is requested to stop
                                                                                                                                                                                    • in Spark SQL
                                                                                                                                                                                    ","text":""},{"location":"rpc/RpcEnv/#implementations","title":"Implementations","text":"
                                                                                                                                                                                    • NettyRpcEnv
                                                                                                                                                                                    "},{"location":"rpc/RpcEnv/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                    RpcEnv takes the following to be created:

                                                                                                                                                                                    • SparkConf

                                                                                                                                                                                      RpcEnv is created using RpcEnv.create utility.

                                                                                                                                                                                      Abstract Class

                                                                                                                                                                                      RpcEnv\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete RpcEnvs.

                                                                                                                                                                                      "},{"location":"rpc/RpcEnv/#creating-rpcenv","title":"Creating RpcEnv
                                                                                                                                                                                      create(\n  name: String,\n  host: String,\n  port: Int,\n  conf: SparkConf,\n  securityManager: SecurityManager,\n  clientMode: Boolean = false): RpcEnv // (1)\ncreate(\n  name: String,\n  bindAddress: String,\n  advertiseAddress: String,\n  port: Int,\n  conf: SparkConf,\n  securityManager: SecurityManager,\n  numUsableCores: Int,\n  clientMode: Boolean): RpcEnv\n
                                                                                                                                                                                      1. Uses 0 for numUsableCores

                                                                                                                                                                                      create creates a NettyRpcEnvFactory and requests it to create an RpcEnv (with a new RpcEnvConfig with all the given arguments).

                                                                                                                                                                                      create is used when:

                                                                                                                                                                                      • SparkEnv utility is requested to create a SparkEnv (clientMode flag is turned on for executors and off for the driver)
                                                                                                                                                                                      • With clientMode flag true:

                                                                                                                                                                                        • CoarseGrainedExecutorBackend is requested to run
                                                                                                                                                                                        • ClientApp (Spark Standalone) is requested to start
                                                                                                                                                                                        • Master (Spark Standalone) is requested to startRpcEnvAndEndpoint
                                                                                                                                                                                        • Worker (Spark Standalone) is requested to startRpcEnvAndEndpoint
                                                                                                                                                                                        • DriverWrapper is launched
                                                                                                                                                                                        • ApplicationMaster (Spark on YARN) is requested to runExecutorLauncher (in client deploy mode)
                                                                                                                                                                                      ","text":""},{"location":"rpc/RpcEnv/#default-endpoint-lookup-timeout","title":"Default Endpoint Lookup Timeout

                                                                                                                                                                                      RpcEnv uses the default lookup timeout for...FIXME

                                                                                                                                                                                      When a remote endpoint is resolved, a local RPC environment connects to the remote one (endpoint lookup). To configure the time needed for the endpoint lookup you can use the following settings.

                                                                                                                                                                                      It is a prioritized list of lookup timeout properties (the higher on the list, the more important):

                                                                                                                                                                                      • spark.rpc.lookupTimeout
                                                                                                                                                                                      • spark.network.timeout
                                                                                                                                                                                      ","text":""},{"location":"rpc/RpcEnvConfig/","title":"RpcEnvConfig","text":"

                                                                                                                                                                                      = RpcEnvConfig :page-toclevels: -1

                                                                                                                                                                                      [[creating-instance]] RpcEnvConfig is a configuration of an rpc:RpcEnv.md[]:

                                                                                                                                                                                      • [[conf]] SparkConf.md[]
                                                                                                                                                                                      • [[name]] System Name
                                                                                                                                                                                      • [[bindAddress]] Bind Address
                                                                                                                                                                                      • [[advertiseAddress]] Advertised Address
                                                                                                                                                                                      • [[port]] Port
                                                                                                                                                                                      • [[securityManager]] SecurityManager
                                                                                                                                                                                      • [[numUsableCores]] Number of CPU cores
                                                                                                                                                                                      • <>

                                                                                                                                                                                        RpcEnvConfig is created when RpcEnv utility is used to rpc:RpcEnv.md#create[create an RpcEnv] (using rpc:RpcEnvFactory.md[]).

                                                                                                                                                                                        == [[clientMode]] Client Mode

                                                                                                                                                                                        When an RPC Environment is initialized core:SparkEnv.md#createDriverEnv[as part of the initialization of the driver] or core:SparkEnv.md#createExecutorEnv[executors] (using RpcEnv.create), clientMode is false for the driver and true for executors.

                                                                                                                                                                                        Copied (almost verbatim) from https://issues.apache.org/jira/browse/SPARK-10997[SPARK-10997 Netty-based RPC env should support a \"client-only\" mode] and the https://github.com/apache/spark/commit/71d1c907dec446db566b19f912159fd8f46deb7d[commit]:

                                                                                                                                                                                        \"Client mode\" means the RPC env will not listen for incoming connections.

                                                                                                                                                                                        This allows certain processes in the Spark stack (such as Executors or tha YARN client-mode AM) to act as pure clients when using the netty-based RPC backend, reducing the number of sockets Spark apps need to use and also the number of open ports.

                                                                                                                                                                                        The AM connects to the driver in \"client mode\", and that connection is used for all driver -- AM communication, and so the AM is properly notified when the connection goes down.

                                                                                                                                                                                        In \"general\", non-YARN case, clientMode flag is therefore enabled for executors and disabled for the driver.

                                                                                                                                                                                        In Spark on YARN in client deploy mode, clientMode flag is however enabled explicitly when Spark on YARN's spark-yarn-applicationmaster.md#runExecutorLauncher-sparkYarnAM[ApplicationMaster] creates the sparkYarnAM RPC Environment.

                                                                                                                                                                                        "},{"location":"rpc/RpcEnvFactory/","title":"RpcEnvFactory","text":"

                                                                                                                                                                                        = RpcEnvFactory

                                                                                                                                                                                        RpcEnvFactory is an abstraction of <> to <>.

                                                                                                                                                                                        == [[implementations]] Available RpcEnvFactories

                                                                                                                                                                                        rpc:NettyRpcEnvFactory.md[] is the default and only known RpcEnvFactory in Apache Spark (as of https://github.com/apache/spark/commit/4f5a24d7e73104771f233af041eeba4f41675974[this commit]).

                                                                                                                                                                                        == [[create]] Creating RpcEnv

                                                                                                                                                                                        "},{"location":"rpc/RpcEnvFactory/#sourcescala","title":"[source,scala]","text":"

                                                                                                                                                                                        create( config: RpcEnvConfig): RpcEnv

                                                                                                                                                                                        create is used when RpcEnv utility is requested to rpc:RpcEnv.md#create[create an RpcEnv].

                                                                                                                                                                                        "},{"location":"rpc/RpcEnvFileServer/","title":"RpcEnvFileServer","text":"

                                                                                                                                                                                        = RpcEnvFileServer

                                                                                                                                                                                        RpcEnvFileServer is...FIXME

                                                                                                                                                                                        "},{"location":"rpc/RpcUtils/","title":"RpcUtils","text":""},{"location":"rpc/RpcUtils/#maximum-message-size","title":"Maximum Message Size
                                                                                                                                                                                        maxMessageSizeBytes(\n  conf: SparkConf): Int\n

                                                                                                                                                                                        maxMessageSizeBytes is the value of spark.rpc.message.maxSize configuration property in bytes (by multiplying the value by 1024 * 1024).

                                                                                                                                                                                        maxMessageSizeBytes throws an IllegalArgumentException when the value is above 2047 MB:

                                                                                                                                                                                        spark.rpc.message.maxSize should not be greater than 2047 MB\n

                                                                                                                                                                                        maxMessageSizeBytes is used when:

                                                                                                                                                                                        • MapOutputTrackerMaster is requested for the maxRpcMessageSize
                                                                                                                                                                                        • Executor is requested for the maxDirectResultSize
                                                                                                                                                                                        • CoarseGrainedSchedulerBackend is requested for the maxRpcMessageSize
                                                                                                                                                                                        ","text":""},{"location":"rpc/RpcUtils/#makedriverref","title":"makeDriverRef
                                                                                                                                                                                        makeDriverRef(\n  name: String,\n  conf: SparkConf,\n  rpcEnv: RpcEnv): RpcEndpointRef\n

                                                                                                                                                                                        makeDriverRef...FIXME

                                                                                                                                                                                        makeDriverRef is used when:

                                                                                                                                                                                        • BarrierTaskContext is created
                                                                                                                                                                                        • SparkEnv utility is used to create a SparkEnv (on executors)
                                                                                                                                                                                        • Executor is created
                                                                                                                                                                                        • PluginContextImpl is requested for driverEndpoint
                                                                                                                                                                                        ","text":""},{"location":"rpc/spark-rpc-netty/","title":"Netty-Based RpcEnv","text":"

                                                                                                                                                                                        Netty-based RPC Environment is created by NettyRpcEnvFactory when rpc:index.md#settings[spark.rpc] is netty or org.apache.spark.rpc.netty.NettyRpcEnvFactory.

                                                                                                                                                                                        NettyRpcEnv is only started on spark-driver.md[the driver]. See <>.

                                                                                                                                                                                        The default port to listen to is 7077.

                                                                                                                                                                                        When NettyRpcEnv starts, the following INFO message is printed out in the logs:

                                                                                                                                                                                        Successfully started service 'NettyRpcEnv' on port 0.\n

                                                                                                                                                                                        == [[thread-pools]] Thread Pools

                                                                                                                                                                                        === shuffle-server-ID

                                                                                                                                                                                        EventLoopGroup uses a daemon thread pool called shuffle-server-ID, where ID is a unique integer for NioEventLoopGroup (NIO) or EpollEventLoopGroup (EPOLL) for the Shuffle server.

                                                                                                                                                                                        CAUTION: FIXME Review Netty's NioEventLoopGroup.

                                                                                                                                                                                        CAUTION: FIXME Where are SO_BACKLOG, SO_RCVBUF, SO_SNDBUF channel options used?

                                                                                                                                                                                        === dispatcher-event-loop-ID

                                                                                                                                                                                        NettyRpcEnv's Dispatcher uses the daemon fixed thread pool with <> threads.

                                                                                                                                                                                        Thread names are formatted as dispatcher-event-loop-ID, where ID is a unique, sequentially assigned integer.

                                                                                                                                                                                        It starts the message processing loop on all of the threads.

                                                                                                                                                                                        === netty-rpc-env-timeout

                                                                                                                                                                                        NettyRpcEnv uses the daemon single-thread scheduled thread pool netty-rpc-env-timeout.

                                                                                                                                                                                        \"netty-rpc-env-timeout\" #87 daemon prio=5 os_prio=31 tid=0x00007f887775a000 nid=0xc503 waiting on condition [0x0000000123397000]\n

                                                                                                                                                                                        === netty-rpc-connection-ID

                                                                                                                                                                                        NettyRpcEnv uses the daemon cached thread pool with up to <> threads.

                                                                                                                                                                                        Thread names are formatted as netty-rpc-connection-ID, where ID is a unique, sequentially assigned integer.

                                                                                                                                                                                        == [[settings]] Settings

                                                                                                                                                                                        The Netty-based implementation uses the following properties:

                                                                                                                                                                                        • spark.rpc.io.mode (default: NIO) - NIO or EPOLL for low-level IO. NIO is always available, while EPOLL is only available on Linux. NIO uses io.netty.channel.nio.NioEventLoopGroup while EPOLL io.netty.channel.epoll.EpollEventLoopGroup.
                                                                                                                                                                                        • spark.shuffle.io.numConnectionsPerPeer always equals 1
                                                                                                                                                                                        • spark.rpc.io.threads (default: 0; maximum: 8) - the number of threads to use for the Netty client and server thread pools. ** spark.shuffle.io.serverThreads (default: the value of spark.rpc.io.threads) ** spark.shuffle.io.clientThreads (default: the value of spark.rpc.io.threads)
                                                                                                                                                                                        • spark.rpc.netty.dispatcher.numThreads (default: the number of processors available to JVM)
                                                                                                                                                                                        • spark.rpc.connect.threads (default: 64) - used in cluster mode to communicate with a remote RPC endpoint
                                                                                                                                                                                        • spark.port.maxRetries (default: 16 or 100 for testing when spark.testing is set) controls the maximum number of binding attempts/retries to a port before giving up.

                                                                                                                                                                                        == [[endpoints]] Endpoints

                                                                                                                                                                                        • endpoint-verifier (RpcEndpointVerifier) - a rpc:RpcEndpoint.md[RpcEndpoint] for remote RpcEnvs to query whether an RpcEndpoint exists or not. It uses Dispatcher that keeps track of registered endpoints and responds true/false to CheckExistence message.

                                                                                                                                                                                        endpoint-verifier is used to check out whether a given endpoint exists or not before the endpoint's reference is given back to clients.

                                                                                                                                                                                        One use case is when an spark-standalone.md#AppClient[AppClient connects to standalone Masters] before it registers the application it acts for.

                                                                                                                                                                                        CAUTION: FIXME Who'd like to use endpoint-verifier and how?

                                                                                                                                                                                        == Message Dispatcher

                                                                                                                                                                                        A message dispatcher is responsible for routing RPC messages to the appropriate endpoint(s).

                                                                                                                                                                                        It uses the daemon fixed thread pool dispatcher-event-loop with spark.rpc.netty.dispatcher.numThreads threads for dispatching messages.

                                                                                                                                                                                        \"dispatcher-event-loop-0\" #26 daemon prio=5 os_prio=31 tid=0x00007f8877153800 nid=0x7103 waiting on condition [0x000000011f78b000]\n
                                                                                                                                                                                        "},{"location":"scheduler/","title":"Spark Scheduler","text":"

                                                                                                                                                                                        Spark Scheduler is a core component of Apache Spark that is responsible for scheduling tasks for execution.

                                                                                                                                                                                        Spark Scheduler uses the high-level stage-oriented DAGScheduler and the low-level task-oriented TaskScheduler.

                                                                                                                                                                                        "},{"location":"scheduler/#stage-execution","title":"Stage Execution","text":"

                                                                                                                                                                                        Every partition of a Stage is transformed into a Task (ShuffleMapTask or ResultTask for ShuffleMapStage and ResultStage, respectively).

                                                                                                                                                                                        Submitting a stage can therefore trigger execution of a series of dependent parent stages.

                                                                                                                                                                                        When a Spark job is submitted, a new stage is created (they can be created from scratch or linked to, i.e. shared, if other jobs use them already).

                                                                                                                                                                                        DAGScheduler splits up a job into a collection of Stages. A Stage contains a sequence of narrow transformations that can be completed without shuffling data set, separated at shuffle boundaries (where shuffle occurs). Stages are thus a result of breaking the RDD graph at shuffle boundaries.

                                                                                                                                                                                        Shuffle boundaries introduce a barrier where stages/tasks must wait for the previous stage to finish before they fetch map outputs.

                                                                                                                                                                                        "},{"location":"scheduler/#resources","title":"Resources","text":"
                                                                                                                                                                                        • Deep Dive into the Apache Spark Scheduler by Xingbo Jiang (Databricks)
                                                                                                                                                                                        "},{"location":"scheduler/ActiveJob/","title":"ActiveJob","text":"

                                                                                                                                                                                        ActiveJob (job, action job) is a top-level work item (computation) submitted to DAGScheduler for execution (usually to compute the result of an RDD action).

                                                                                                                                                                                        Executing a job is equivalent to computing the partitions of the RDD an action has been executed upon. The number of partitions (numPartitions) to compute in a job depends on the type of a stage (ResultStage or ShuffleMapStage).

                                                                                                                                                                                        A job starts with a single target RDD, but can ultimately include other RDDs that are all part of RDD lineage.

                                                                                                                                                                                        The parent stages are always ShuffleMapStages.

                                                                                                                                                                                        Note

                                                                                                                                                                                        Not always all partitions have to be computed for ResultStages (e.g. for actions like first() and lookup()).

                                                                                                                                                                                        "},{"location":"scheduler/ActiveJob/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                        ActiveJob takes the following to be created:

                                                                                                                                                                                        • Job ID
                                                                                                                                                                                        • Final Stage
                                                                                                                                                                                        • CallSite
                                                                                                                                                                                        • JobListener
                                                                                                                                                                                        • Properties

                                                                                                                                                                                          ActiveJob is created when:

                                                                                                                                                                                          • DAGScheduler is requested to handleJobSubmitted and handleMapStageSubmitted
                                                                                                                                                                                          "},{"location":"scheduler/ActiveJob/#final-stage","title":"Final Stage

                                                                                                                                                                                          ActiveJob is given a Stage when created that determines a logical type:

                                                                                                                                                                                          1. Map-Stage Job that computes the map output files for a ShuffleMapStage (for submitMapStage) before any downstream stages are submitted
                                                                                                                                                                                          2. Result job that computes a ResultStage to execute an action
                                                                                                                                                                                          ","text":""},{"location":"scheduler/ActiveJob/#finished-computed-partitions","title":"Finished (Computed) Partitions

                                                                                                                                                                                          ActiveJob uses finished registry of flags to track partitions that have already been computed (true) or not (false).

                                                                                                                                                                                          ","text":""},{"location":"scheduler/BlacklistTracker/","title":"BlacklistTracker","text":"

                                                                                                                                                                                          BlacklistTracker is...FIXME

                                                                                                                                                                                          "},{"location":"scheduler/CoarseGrainedSchedulerBackend/","title":"CoarseGrainedSchedulerBackend","text":"

                                                                                                                                                                                          CoarseGrainedSchedulerBackend is a base SchedulerBackend for coarse-grained schedulers.

                                                                                                                                                                                          CoarseGrainedSchedulerBackend is an ExecutorAllocationClient.

                                                                                                                                                                                          CoarseGrainedSchedulerBackend is responsible for requesting resources from a cluster manager for executors that it in turn uses to launch tasks (on CoarseGrainedExecutorBackend).

                                                                                                                                                                                          CoarseGrainedSchedulerBackend holds executors for the duration of the Spark job rather than relinquishing executors whenever a task is done and asking the scheduler to launch a new executor for each new task.

                                                                                                                                                                                          CoarseGrainedSchedulerBackend registers CoarseGrainedScheduler RPC Endpoint that executors use for RPC communication.

                                                                                                                                                                                          Note

                                                                                                                                                                                          Active executors are executors that are not pending to be removed or lost.

                                                                                                                                                                                          "},{"location":"scheduler/CoarseGrainedSchedulerBackend/#implementations","title":"Implementations","text":"
                                                                                                                                                                                          • KubernetesClusterSchedulerBackend (Spark on Kubernetes)
                                                                                                                                                                                          • MesosCoarseGrainedSchedulerBackend (Spark on Mesos)
                                                                                                                                                                                          • StandaloneSchedulerBackend (Spark Standalone)
                                                                                                                                                                                          • YarnSchedulerBackend (Spark on YARN)
                                                                                                                                                                                          "},{"location":"scheduler/CoarseGrainedSchedulerBackend/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                          CoarseGrainedSchedulerBackend takes the following to be created:

                                                                                                                                                                                          • TaskSchedulerImpl
                                                                                                                                                                                          • RpcEnv"},{"location":"scheduler/CoarseGrainedSchedulerBackend/#driverEndpoint","title":"CoarseGrainedScheduler RPC Endpoint","text":"
                                                                                                                                                                                            driverEndpoint: RpcEndpointRef\n

                                                                                                                                                                                            CoarseGrainedSchedulerBackend registers a DriverEndpoint RPC endpoint known as CoarseGrainedScheduler when created.

                                                                                                                                                                                            "},{"location":"scheduler/CoarseGrainedSchedulerBackend/#createDriverEndpoint","title":"Creating DriverEndpoint","text":"
                                                                                                                                                                                            createDriverEndpoint(): DriverEndpoint\n

                                                                                                                                                                                            createDriverEndpoint creates a new DriverEndpoint.

                                                                                                                                                                                            Note

                                                                                                                                                                                            The purpose of createDriverEndpoint is to let CoarseGrainedSchedulerBackends to provide their own custom implementations:

                                                                                                                                                                                            • KubernetesClusterSchedulerBackend (Spark on Kubernetes)
                                                                                                                                                                                            • StandaloneSchedulerBackend

                                                                                                                                                                                            createDriverEndpoint is used when:

                                                                                                                                                                                            • CoarseGrainedSchedulerBackend is created (and registers CoarseGrainedScheduler RPC endpoint)
                                                                                                                                                                                            "},{"location":"scheduler/CoarseGrainedSchedulerBackend/#maxNumConcurrentTasks","title":"Maximum Number of Concurrent Tasks","text":"SchedulerBackend
                                                                                                                                                                                            maxNumConcurrentTasks(\n  rp: ResourceProfile): Int\n

                                                                                                                                                                                            maxNumConcurrentTasks is part of the SchedulerBackend abstraction.

                                                                                                                                                                                            maxNumConcurrentTasks uses the Available Executors registry to find out about available ResourceProfiles, total number of CPU cores and ExecutorResourceInfos of every active executor.

                                                                                                                                                                                            In the end, maxNumConcurrentTasks calculates the available (parallel) slots for the given ResourceProfile (and given the available executor resources).

                                                                                                                                                                                            "},{"location":"scheduler/CoarseGrainedSchedulerBackend/#totalregisteredexecutors-registry","title":"totalRegisteredExecutors Registry
                                                                                                                                                                                            totalRegisteredExecutors: AtomicInteger\n

                                                                                                                                                                                            totalRegisteredExecutors is an internal registry of the number of registered executors (a Java AtomicInteger).

                                                                                                                                                                                            totalRegisteredExecutors starts from 0.

                                                                                                                                                                                            totalRegisteredExecutors is incremented when:

                                                                                                                                                                                            • DriverEndpoint is requested to handle a RegisterExecutor message

                                                                                                                                                                                            totalRegisteredExecutors is decremented when:

                                                                                                                                                                                            • DriverEndpoint is requested to remove an executor
                                                                                                                                                                                            ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#sufficient-resources-registered","title":"Sufficient Resources Registered
                                                                                                                                                                                            sufficientResourcesRegistered(): Boolean\n

                                                                                                                                                                                            sufficientResourcesRegistered is true (and is supposed to be overriden by custom CoarseGrainedSchedulerBackends).

                                                                                                                                                                                            ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#minimum-resources-available-ratio","title":"Minimum Resources Available Ratio
                                                                                                                                                                                            minRegisteredRatio: Double\n

                                                                                                                                                                                            minRegisteredRatio is a ratio of the minimum resources available to the total expected resources for the CoarseGrainedSchedulerBackend to be ready for scheduling tasks (for execution).

                                                                                                                                                                                            minRegisteredRatio uses spark.scheduler.minRegisteredResourcesRatio configuration property if defined or defaults to 0.0.

                                                                                                                                                                                            minRegisteredRatio can be between 0.0 and 1.0 (inclusive).

                                                                                                                                                                                            minRegisteredRatio is used when:

                                                                                                                                                                                            • CoarseGrainedSchedulerBackend is requested to isReady
                                                                                                                                                                                            • StandaloneSchedulerBackend is requested to sufficientResourcesRegistered
                                                                                                                                                                                            • KubernetesClusterSchedulerBackend is requested to sufficientResourcesRegistered
                                                                                                                                                                                            • MesosCoarseGrainedSchedulerBackend is requested to sufficientResourcesRegistered
                                                                                                                                                                                            • YarnSchedulerBackend is requested to sufficientResourcesRegistered
                                                                                                                                                                                            ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#available-executors-registry","title":"Available Executors Registry
                                                                                                                                                                                            executorDataMap: HashMap[String, ExecutorData]\n

                                                                                                                                                                                            CoarseGrainedSchedulerBackend tracks available executors using executorDataMap registry (of ExecutorDatas by executor id).

                                                                                                                                                                                            A new entry is added when DriverEndpoint is requested to handle RegisterExecutor message.

                                                                                                                                                                                            An entry is removed when DriverEndpoint is requested to handle RemoveExecutor message or a remote host (with one or many executors) disconnects.

                                                                                                                                                                                            ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#revive-messages-scheduler-service","title":"Revive Messages Scheduler Service
                                                                                                                                                                                            reviveThread: ScheduledExecutorService\n

                                                                                                                                                                                            CoarseGrainedSchedulerBackend creates a Java ScheduledExecutorService when created.

                                                                                                                                                                                            The ScheduledExecutorService is used by DriverEndpoint RPC Endpoint to post ReviveOffers messages regularly.

                                                                                                                                                                                            ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#maximum-size-of-rpc-message","title":"Maximum Size of RPC Message

                                                                                                                                                                                            maxRpcMessageSize is the value of spark.rpc.message.maxSize configuration property.

                                                                                                                                                                                            ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#making-fake-resource-offers-on-executors","title":"Making Fake Resource Offers on Executors
                                                                                                                                                                                            makeOffers(): Unit\nmakeOffers(\n  executorId: String): Unit\n

                                                                                                                                                                                            makeOffers takes the active executors (out of the <> internal registry) and creates WorkerOffer resource offers for each (one per executor with the executor's id, host and free cores).

                                                                                                                                                                                            CAUTION: Only free cores are considered in making offers. Memory is not! Why?!

                                                                                                                                                                                            It then requests TaskSchedulerImpl.md#resourceOffers[TaskSchedulerImpl to process the resource offers] to create a collection of TaskDescription collections that it in turn uses to launch tasks.

                                                                                                                                                                                            ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#getting-executor-ids","title":"Getting Executor Ids

                                                                                                                                                                                            When called, getExecutorIds simply returns executor ids from the internal <> registry.

                                                                                                                                                                                            NOTE: It is called when SparkContext.md#getExecutorIds[SparkContext calculates executor ids].

                                                                                                                                                                                            ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#requesting-executors","title":"Requesting Executors
                                                                                                                                                                                            requestExecutors(\n  numAdditionalExecutors: Int): Boolean\n

                                                                                                                                                                                            requestExecutors is a \"decorator\" method that ultimately calls a cluster-specific doRequestTotalExecutors method and returns whether the request was acknowledged or not (it is assumed false by default).

                                                                                                                                                                                            requestExecutors method is part of the ExecutorAllocationClient abstraction.

                                                                                                                                                                                            When called, you should see the following INFO message followed by DEBUG message in the logs:

                                                                                                                                                                                            Requesting [numAdditionalExecutors] additional executor(s) from the cluster manager\nNumber of pending executors is now [numPendingExecutors]\n

                                                                                                                                                                                            <> is increased by the input numAdditionalExecutors.

                                                                                                                                                                                            requestExecutors requests executors from a cluster manager (that reflects the current computation needs). The \"new executor total\" is a sum of the internal <> and <> decreased by the <>.

                                                                                                                                                                                            If numAdditionalExecutors is negative, a IllegalArgumentException is thrown:

                                                                                                                                                                                            Attempted to request a negative number of additional executor(s) [numAdditionalExecutors] from the cluster manager. Please specify a positive number!\n

                                                                                                                                                                                            NOTE: It is a final method that no other scheduler backends could customize further.

                                                                                                                                                                                            NOTE: The method is a synchronized block that makes multiple concurrent requests be handled in a serial fashion, i.e. one by one.

                                                                                                                                                                                            ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#requesting-exact-number-of-executors","title":"Requesting Exact Number of Executors
                                                                                                                                                                                            requestTotalExecutors(\n  numExecutors: Int,\n  localityAwareTasks: Int,\n  hostToLocalTaskCount: Map[String, Int]): Boolean\n

                                                                                                                                                                                            requestTotalExecutors is a \"decorator\" method that ultimately calls a cluster-specific doRequestTotalExecutors method and returns whether the request was acknowledged or not (it is assumed false by default).

                                                                                                                                                                                            requestTotalExecutors is part of the ExecutorAllocationClient abstraction.

                                                                                                                                                                                            It sets the internal <> and <> registries. It then calculates the exact number of executors which is the input numExecutors and the <> decreased by the number of <>.

                                                                                                                                                                                            If numExecutors is negative, a IllegalArgumentException is thrown:

                                                                                                                                                                                            Attempted to request a negative number of executor(s) [numExecutors] from the cluster manager. Please specify a positive number!\n

                                                                                                                                                                                            NOTE: It is a final method that no other scheduler backends could customize further.

                                                                                                                                                                                            NOTE: The method is a synchronized block that makes multiple concurrent requests be handled in a serial fashion, i.e. one by one.

                                                                                                                                                                                            ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#finding-default-level-of-parallelism","title":"Finding Default Level of Parallelism
                                                                                                                                                                                            defaultParallelism(): Int\n

                                                                                                                                                                                            defaultParallelism is part of the SchedulerBackend abstraction.

                                                                                                                                                                                            defaultParallelism is spark.default.parallelism configuration property if defined.

                                                                                                                                                                                            Otherwise, defaultParallelism is the maximum of totalCoreCount or 2.

                                                                                                                                                                                            ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#killing-task","title":"Killing Task
                                                                                                                                                                                            killTask(\n  taskId: Long,\n  executorId: String,\n  interruptThread: Boolean): Unit\n

                                                                                                                                                                                            killTask is part of the SchedulerBackend abstraction.

                                                                                                                                                                                            killTask simply sends a KillTask message to <>.","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#stopping-all-executors","title":"Stopping All Executors

                                                                                                                                                                                            stopExecutors sends a blocking <> message to <> (if already initialized).

                                                                                                                                                                                            NOTE: It is called exclusively while CoarseGrainedSchedulerBackend is <>.

                                                                                                                                                                                            You should see the following INFO message in the logs:

                                                                                                                                                                                            Shutting down all executors\n
                                                                                                                                                                                            ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#reset-state","title":"Reset State

                                                                                                                                                                                            reset resets the internal state:

                                                                                                                                                                                            1. Sets <> to 0
                                                                                                                                                                                            2. Clears executorsPendingToRemove
                                                                                                                                                                                            3. Sends a blocking <> message to <> for every executor (in the internal executorDataMap) to inform it about SlaveLost with the message: +
                                                                                                                                                                                              Stale executor after cluster manager re-registered.\n

                                                                                                                                                                                              reset is a method that is defined in CoarseGrainedSchedulerBackend, but used and overriden exclusively by yarn/spark-yarn-yarnschedulerbackend.md[YarnSchedulerBackend].

                                                                                                                                                                                              ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#remove-executor","title":"Remove Executor
                                                                                                                                                                                              removeExecutor(executorId: String, reason: ExecutorLossReason)\n

                                                                                                                                                                                              removeExecutor sends a blocking <> message to <>.

                                                                                                                                                                                              NOTE: It is called by subclasses spark-standalone.md#SparkDeploySchedulerBackend[SparkDeploySchedulerBackend], spark-mesos/spark-mesos.md#CoarseMesosSchedulerBackend[CoarseMesosSchedulerBackend], and yarn/spark-yarn-yarnschedulerbackend.md[YarnSchedulerBackend].

                                                                                                                                                                                              ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#coarsegrainedscheduler-rpc-endpoint","title":"CoarseGrainedScheduler RPC Endpoint

                                                                                                                                                                                              When <>, it registers CoarseGrainedScheduler RPC endpoint to be the driver's communication endpoint.

                                                                                                                                                                                              driverEndpoint is a DriverEndpoint.

                                                                                                                                                                                              Note

                                                                                                                                                                                              CoarseGrainedSchedulerBackend is created while SparkContext is being created that in turn lives inside a Spark driver. That explains the name driverEndpoint (at least partially).

                                                                                                                                                                                              It is called standalone scheduler's driver endpoint internally.

                                                                                                                                                                                              It tracks:

                                                                                                                                                                                              It uses driver-revive-thread daemon single-thread thread pool for ...FIXME

                                                                                                                                                                                              CAUTION: FIXME A potential issue with driverEndpoint.asInstanceOf[NettyRpcEndpointRef].toURI - doubles spark:// prefix.

                                                                                                                                                                                              ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#starting-coarsegrainedschedulerbackend","title":"Starting CoarseGrainedSchedulerBackend
                                                                                                                                                                                              start(): Unit\n

                                                                                                                                                                                              start is part of the SchedulerBackend abstraction.

                                                                                                                                                                                              start takes all spark.-prefixed properties and registers the <CoarseGrainedScheduler RPC endpoint>> (backed by DriverEndpoint ThreadSafeRpcEndpoint).

                                                                                                                                                                                              NOTE: start uses <> to access the current SparkContext.md[SparkContext] and in turn SparkConf.md[SparkConf].

                                                                                                                                                                                              NOTE: start uses <> that was given when <CoarseGrainedSchedulerBackend was created>>.","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#checking-if-sufficient-compute-resources-available-or-waiting-time-passedmethod","title":"Checking If Sufficient Compute Resources Available Or Waiting Time PassedMethod

                                                                                                                                                                                              isReady(): Boolean\n

                                                                                                                                                                                              isReady is part of the SchedulerBackend abstraction.

                                                                                                                                                                                              isReady allows to delay task launching until <> or <> passes.

                                                                                                                                                                                              Internally, isReady <>.

                                                                                                                                                                                              NOTE: <> by default responds that sufficient resources are available.

                                                                                                                                                                                              If the <>, you should see the following INFO message in the logs and isReady is positive.

                                                                                                                                                                                              SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: [minRegisteredRatio]\n

                                                                                                                                                                                              If there are no sufficient resources available yet (the above requirement does not hold), isReady checks whether the time since <> passed <> to give a way to launch tasks (even when <> not being reached yet).

                                                                                                                                                                                              You should see the following INFO message in the logs and isReady is positive.

                                                                                                                                                                                              SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: [maxRegisteredWaitingTimeMs](ms)\n

                                                                                                                                                                                              Otherwise, when <> and <> has not elapsed, isReady is negative.","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#reviving-resource-offers","title":"Reviving Resource Offers

                                                                                                                                                                                              reviveOffers(): Unit\n

                                                                                                                                                                                              reviveOffers is part of the SchedulerBackend abstraction.

                                                                                                                                                                                              reviveOffers simply sends a ReviveOffers message to CoarseGrainedSchedulerBackend RPC endpoint.

                                                                                                                                                                                              ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#stopping-schedulerbackend","title":"Stopping SchedulerBackend
                                                                                                                                                                                              stop(): Unit\n

                                                                                                                                                                                              stop is part of the SchedulerBackend abstraction.

                                                                                                                                                                                              stop <> and <CoarseGrainedScheduler RPC endpoint>> (by sending a blocking StopDriver message).

                                                                                                                                                                                              In case of any Exception, stop reports a SparkException with the message:

                                                                                                                                                                                              Error stopping standalone scheduler's driver endpoint\n
                                                                                                                                                                                              ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#createdriverendpointref","title":"createDriverEndpointRef
                                                                                                                                                                                              createDriverEndpointRef(\n  properties: ArrayBuffer[(String, String)]): RpcEndpointRef\n

                                                                                                                                                                                              createDriverEndpointRef <> and rpc:index.md#setupEndpoint[registers it] as CoarseGrainedScheduler.

                                                                                                                                                                                              createDriverEndpointRef is used when CoarseGrainedSchedulerBackend is requested to <>.","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#checking-whether-executor-is-active","title":"Checking Whether Executor is Active

                                                                                                                                                                                              isExecutorActive(\n  id: String): Boolean\n

                                                                                                                                                                                              isExecutorActive is part of the ExecutorAllocationClient abstraction.

                                                                                                                                                                                              isExecutorActive...FIXME

                                                                                                                                                                                              ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#requesting-executors-from-cluster-manager","title":"Requesting Executors from Cluster Manager
                                                                                                                                                                                              doRequestTotalExecutors(\n  requestedTotal: Int): Future[Boolean]\n

                                                                                                                                                                                              doRequestTotalExecutors returns a completed Future with false value.

                                                                                                                                                                                              doRequestTotalExecutors is used when:

                                                                                                                                                                                              • CoarseGrainedSchedulerBackend is requested to requestExecutors, requestTotalExecutors and killExecutors
                                                                                                                                                                                              ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#logging","title":"Logging

                                                                                                                                                                                              Enable ALL logging level for org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend logger to see what happens inside.

                                                                                                                                                                                              Add the following line to conf/log4j.properties:

                                                                                                                                                                                              log4j.logger.org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend=ALL\n

                                                                                                                                                                                              Refer to Logging.

                                                                                                                                                                                              ","text":""},{"location":"scheduler/CompressedMapStatus/","title":"CompressedMapStatus","text":"

                                                                                                                                                                                              CompressedMapStatus is...FIXME

                                                                                                                                                                                              "},{"location":"scheduler/DAGScheduler/","title":"DAGScheduler","text":""},{"location":"scheduler/DAGScheduler/#dagscheduler","title":"DAGScheduler","text":"

                                                                                                                                                                                              Note

                                                                                                                                                                                              The introduction that follows was highly influenced by the scaladoc of org.apache.spark.scheduler.DAGScheduler. As DAGScheduler is a private class it does not appear in the official API documentation. You are strongly encouraged to read the sources and only then read this and the related pages afterwards.

                                                                                                                                                                                              "},{"location":"scheduler/DAGScheduler/#introduction","title":"Introduction","text":"

                                                                                                                                                                                              DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling using Jobs and Stages.

                                                                                                                                                                                              DAGScheduler transforms a logical execution plan (RDD lineage of dependencies built using RDD transformations) to a physical execution plan (using stages).

                                                                                                                                                                                              After an action has been called on an RDD, SparkContext hands over a logical plan to DAGScheduler that it in turn translates to a set of stages that are submitted as TaskSets for execution.

                                                                                                                                                                                              DAGScheduler works solely on the driver and is created as part of SparkContext's initialization (right after TaskScheduler and SchedulerBackend are ready).

                                                                                                                                                                                              DAGScheduler does three things in Spark:

                                                                                                                                                                                              • Computes an execution DAG (DAG of stages) for a job
                                                                                                                                                                                              • Determines the preferred locations to run each task on
                                                                                                                                                                                              • Handles failures due to shuffle output files being lost

                                                                                                                                                                                              DAGScheduler computes a directed acyclic graph (DAG) of stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a minimal schedule to run jobs. It then submits stages to TaskScheduler.

                                                                                                                                                                                              In addition to coming up with the execution DAG, DAGScheduler also determines the preferred locations to run each task on, based on the current cache status, and passes the information to TaskScheduler.

                                                                                                                                                                                              DAGScheduler tracks which RDDs are cached (or persisted) to avoid \"recomputing\" them (re-doing the map side of a shuffle). DAGScheduler remembers what ShuffleMapStages have already produced output files (that are stored in BlockManagers).

                                                                                                                                                                                              DAGScheduler is only interested in cache location coordinates (i.e. host and executor id, per partition of a RDD).

                                                                                                                                                                                              Furthermore, DAGScheduler handles failures due to shuffle output files being lost, in which case old stages may need to be resubmitted. Failures within a stage that are not caused by shuffle file loss are handled by the TaskScheduler itself, which will retry each task a small number of times before cancelling the whole stage.

                                                                                                                                                                                              DAGScheduler uses an event queue architecture in which a thread can post DAGSchedulerEvent events, e.g. a new job or stage being submitted, that DAGScheduler reads and executes sequentially. See the section Event Bus.

                                                                                                                                                                                              DAGScheduler runs stages in topological order.

                                                                                                                                                                                              DAGScheduler uses SparkContext, TaskScheduler, LiveListenerBus, MapOutputTracker and BlockManager for its services. However, at the very minimum, DAGScheduler takes a SparkContext only (and requests SparkContext for the other services).

                                                                                                                                                                                              When DAGScheduler schedules a job as a result of executing an action on a RDD or calling SparkContext.runJob directly, it spawns parallel tasks to compute (partial) results per partition.

                                                                                                                                                                                              "},{"location":"scheduler/DAGScheduler/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                              DAGScheduler takes the following to be created:

                                                                                                                                                                                              • SparkContext
                                                                                                                                                                                              • TaskScheduler
                                                                                                                                                                                              • LiveListenerBus
                                                                                                                                                                                              • MapOutputTrackerMaster
                                                                                                                                                                                              • BlockManagerMaster
                                                                                                                                                                                              • SparkEnv
                                                                                                                                                                                              • Clock

                                                                                                                                                                                                DAGScheduler is created\u00a0when SparkContext is created.

                                                                                                                                                                                                While being created, DAGScheduler requests the TaskScheduler to associate itself with and requests DAGScheduler Event Bus to start accepting events.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#submitMapStage","title":"Submitting MapStage for Execution (Posting MapStageSubmitted)","text":"
                                                                                                                                                                                                submitMapStage[K, V, C](\n  dependency: ShuffleDependency[K, V, C],\n  callback: MapOutputStatistics => Unit,\n  callSite: CallSite,\n  properties: Properties): JobWaiter[MapOutputStatistics]\n

                                                                                                                                                                                                submitMapStage requests the given ShuffleDependency for the RDD.

                                                                                                                                                                                                submitMapStage gets the job ID and increments it (for future submissions).

                                                                                                                                                                                                submitMapStage creates a JobWaiter to wait for a MapOutputStatistics. The JobWaiter waits for 1 task and, when completed successfully, executes the given callback function with the computed MapOutputStatistics.

                                                                                                                                                                                                In the end, submitMapStage posts a MapStageSubmitted and returns the JobWaiter.

                                                                                                                                                                                                Used when:

                                                                                                                                                                                                • SparkContext is requested to submit a MapStage for execution
                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#metricsSource","title":"DAGSchedulerSource

                                                                                                                                                                                                DAGScheduler uses DAGSchedulerSource for performance metrics.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#eventProcessLoop","title":"DAGScheduler Event Bus

                                                                                                                                                                                                DAGScheduler uses an event bus to process scheduling events on a separate thread (one by one and asynchronously).

                                                                                                                                                                                                DAGScheduler requests the event bus to start right when created and stops it when requested to stop.

                                                                                                                                                                                                DAGScheduler defines event-posting methods for posting DAGSchedulerEvent events to the event bus.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#taskScheduler","title":"TaskScheduler

                                                                                                                                                                                                DAGScheduler is given a TaskScheduler when created.

                                                                                                                                                                                                TaskScheduler is used for the following:

                                                                                                                                                                                                • Submitting missing tasks of a stage
                                                                                                                                                                                                • Handling task completion (CompletionEvent)
                                                                                                                                                                                                • Killing a task
                                                                                                                                                                                                • Failing a job and all other independent single-job stages
                                                                                                                                                                                                • Stopping itself
                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#runJob","title":"Running Job
                                                                                                                                                                                                runJob[T, U](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) => U,\n  partitions: Seq[Int],\n  callSite: CallSite,\n  resultHandler: (Int, U) => Unit,\n  properties: Properties): Unit\n

                                                                                                                                                                                                runJob submits a job and waits until a result is available.

                                                                                                                                                                                                runJob prints out the following INFO message to the logs when the job has finished successfully:

                                                                                                                                                                                                Job [jobId] finished: [callSite], took [time] s\n

                                                                                                                                                                                                runJob prints out the following INFO message to the logs when the job has failed:

                                                                                                                                                                                                Job [jobId] failed: [callSite], took [time] s\n

                                                                                                                                                                                                runJob is used when:

                                                                                                                                                                                                • SparkContext is requested to run a job
                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#submitJob","title":"Submitting Job
                                                                                                                                                                                                submitJob[T, U](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) => U,\n  partitions: Seq[Int],\n  callSite: CallSite,\n  resultHandler: (Int, U) => Unit,\n  properties: Properties): JobWaiter[U]\n

                                                                                                                                                                                                submitJob increments the nextJobId internal counter.

                                                                                                                                                                                                submitJob creates a JobWaiter for the (number of) partitions and the given resultHandler function.

                                                                                                                                                                                                submitJob requests the DAGSchedulerEventProcessLoop to post a JobSubmitted.

                                                                                                                                                                                                In the end, submitJob returns the JobWaiter.

                                                                                                                                                                                                For empty partitions (no partitions to compute), submitJob requests the LiveListenerBus to post a SparkListenerJobStart and SparkListenerJobEnd (with JobSucceeded result marker) events and returns a JobWaiter with no tasks to wait for.

                                                                                                                                                                                                submitJob throws an IllegalArgumentException when the partitions indices are not among the partitions of the given RDD:

                                                                                                                                                                                                Attempting to access a non-existent partition: [p]. Total number of partitions: [maxPartitions]\n

                                                                                                                                                                                                submitJob is used when:

                                                                                                                                                                                                • SparkContext is requested to submit a job
                                                                                                                                                                                                • DAGScheduler is requested to run a job
                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#cacheLocs","title":"Partition Placement Preferences

                                                                                                                                                                                                DAGScheduler keeps track of block locations per RDD and partition.

                                                                                                                                                                                                DAGScheduler uses TaskLocation that includes a host name and an executor id on that host (as ExecutorCacheTaskLocation).

                                                                                                                                                                                                The keys are RDDs (their ids) and the values are arrays indexed by partition numbers.

                                                                                                                                                                                                Each entry is a set of block locations where a RDD partition is cached, i.e. the BlockManagers of the blocks.

                                                                                                                                                                                                Initialized empty when DAGScheduler is created.

                                                                                                                                                                                                Used when DAGScheduler is requested for the locations of the cache blocks of a RDD.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#activeJobs","title":"ActiveJobs

                                                                                                                                                                                                DAGScheduler tracks ActiveJobs:

                                                                                                                                                                                                • Adds a new ActiveJob when requested to handle JobSubmitted or MapStageSubmitted events

                                                                                                                                                                                                • Removes an ActiveJob when requested to clean up after an ActiveJob and independent stages.

                                                                                                                                                                                                • Removes all ActiveJobs when requested to doCancelAllJobs.

                                                                                                                                                                                                DAGScheduler uses ActiveJobs registry when requested to handle JobGroupCancelled or TaskCompletion events, to cleanUpAfterSchedulerStop and to abort a stage.

                                                                                                                                                                                                The number of ActiveJobs is available using job.activeJobs performance metric.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#createResultStage","title":"Creating ResultStage for RDD
                                                                                                                                                                                                createResultStage(\n  rdd: RDD[_],\n  func: (TaskContext, Iterator[_]) => _,\n  partitions: Array[Int],\n  jobId: Int,\n  callSite: CallSite): ResultStage\n

                                                                                                                                                                                                createResultStage creates a new ResultStage for the ShuffleDependencies and ResourceProfiles of the given RDD.

                                                                                                                                                                                                createResultStage finds the ShuffleDependencies and ResourceProfiles for the given RDD.

                                                                                                                                                                                                createResultStage merges the ResourceProfiles for the Stage (if enabled or reports an exception).

                                                                                                                                                                                                createResultStage does the following checks (that may report violations and break the execution):

                                                                                                                                                                                                • checkBarrierStageWithDynamicAllocation
                                                                                                                                                                                                • checkBarrierStageWithNumSlots
                                                                                                                                                                                                • checkBarrierStageWithRDDChainPattern

                                                                                                                                                                                                createResultStage getOrCreateParentStages (with the ShuffleDependencyies and the given jobId).

                                                                                                                                                                                                createResultStage uses the nextStageId counter for a stage ID.

                                                                                                                                                                                                createResultStage creates a new ResultStage (with the unique id of a ResourceProfile among others).

                                                                                                                                                                                                createResultStage registers the ResultStage with the stage ID in stageIdToStage.

                                                                                                                                                                                                createResultStage updateJobIdStageIdMaps and returns the ResultStage.

                                                                                                                                                                                                createResultStage is used when:

                                                                                                                                                                                                • DAGScheduler is requested to handle a JobSubmitted event
                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#createShuffleMapStage","title":"Creating ShuffleMapStage for ShuffleDependency
                                                                                                                                                                                                createShuffleMapStage(\n  shuffleDep: ShuffleDependency[_, _, _],\n  jobId: Int): ShuffleMapStage\n

                                                                                                                                                                                                createShuffleMapStage creates a ShuffleMapStage for the given ShuffleDependency as follows:

                                                                                                                                                                                                • Stage ID is generated based on nextStageId internal counter

                                                                                                                                                                                                • RDD is taken from the given ShuffleDependency

                                                                                                                                                                                                • Number of tasks is the number of partitions of the RDD

                                                                                                                                                                                                • Parent RDDs

                                                                                                                                                                                                • MapOutputTrackerMaster

                                                                                                                                                                                                createShuffleMapStage registers the ShuffleMapStage in the stageIdToStage and shuffleIdToMapStage internal registries.

                                                                                                                                                                                                createShuffleMapStage updateJobIdStageIdMaps.

                                                                                                                                                                                                createShuffleMapStage requests the MapOutputTrackerMaster to check whether it contains the shuffle ID or not.

                                                                                                                                                                                                If not, createShuffleMapStage prints out the following INFO message to the logs and requests the MapOutputTrackerMaster to register the shuffle.

                                                                                                                                                                                                Registering RDD [id] ([creationSite]) as input to shuffle [shuffleId]\n

                                                                                                                                                                                                createShuffleMapStage is used when:

                                                                                                                                                                                                • DAGScheduler is requested to find or create a ShuffleMapStage for a given ShuffleDependency
                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#cleanupStateForJobAndIndependentStages","title":"Cleaning Up After Job and Independent Stages
                                                                                                                                                                                                cleanupStateForJobAndIndependentStages(\n  job: ActiveJob): Unit\n

                                                                                                                                                                                                cleanupStateForJobAndIndependentStages cleans up the state for job and any stages that are not part of any other job.

                                                                                                                                                                                                cleanupStateForJobAndIndependentStages looks the job up in the internal jobIdToStageIds registry.

                                                                                                                                                                                                If no stages are found, the following ERROR is printed out to the logs:

                                                                                                                                                                                                No stages registered for job [jobId]\n

                                                                                                                                                                                                Oterwise, cleanupStateForJobAndIndependentStages uses stageIdToStage registry to find the stages (the real objects not ids!).

                                                                                                                                                                                                For each stage, cleanupStateForJobAndIndependentStages reads the jobs the stage belongs to.

                                                                                                                                                                                                If the job does not belong to the jobs of the stage, the following ERROR is printed out to the logs:

                                                                                                                                                                                                Job [jobId] not registered for stage [stageId] even though that stage was registered for the job\n

                                                                                                                                                                                                If the job was the only job for the stage, the stage (and the stage id) gets cleaned up from the registries, i.e. runningStages, shuffleIdToMapStage, waitingStages, failedStages and stageIdToStage.

                                                                                                                                                                                                While removing from runningStages, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                Removing running stage [stageId]\n

                                                                                                                                                                                                While removing from waitingStages, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                Removing stage [stageId] from waiting set.\n

                                                                                                                                                                                                While removing from failedStages, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                Removing stage [stageId] from failed set.\n

                                                                                                                                                                                                After all cleaning (using stageIdToStage as the source registry), if the stage belonged to the one and only job, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                After removal of stage [stageId], remaining stages = [stageIdToStage.size]\n

                                                                                                                                                                                                The job is removed from jobIdToStageIds, jobIdToActiveJob, activeJobs registries.

                                                                                                                                                                                                The final stage of the job is removed, i.e. ResultStage or ShuffleMapStage.

                                                                                                                                                                                                cleanupStateForJobAndIndependentStages is used in handleTaskCompletion when a ResultTask has completed successfully, failJobAndIndependentStages and markMapStageJobAsFinished.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#markMapStageJobAsFinished","title":"Marking ShuffleMapStage Job Finished
                                                                                                                                                                                                markMapStageJobAsFinished(\n  job: ActiveJob,\n  stats: MapOutputStatistics): Unit\n

                                                                                                                                                                                                markMapStageJobAsFinished marks the given ActiveJob finished and posts a SparkListenerJobEnd.

                                                                                                                                                                                                markMapStageJobAsFinished requests the given ActiveJob to turn on (true) the 0th bit in the finished partitions registry and increase the number of tasks finished.

                                                                                                                                                                                                markMapStageJobAsFinished requests the given ActiveJob for the JobListener that is requested to taskSucceeded (with the 0th index and the given MapOutputStatistics).

                                                                                                                                                                                                markMapStageJobAsFinished cleanupStateForJobAndIndependentStages.

                                                                                                                                                                                                In the end, markMapStageJobAsFinished requests the LiveListenerBus to post a SparkListenerJobEnd.

                                                                                                                                                                                                markMapStageJobAsFinished is used when:

                                                                                                                                                                                                • DAGScheduler is requested to handleMapStageSubmitted and markMapStageJobsAsFinished
                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#getOrCreateParentStages","title":"Finding Or Creating Missing Direct Parent ShuffleMapStages (For ShuffleDependencies) of RDD
                                                                                                                                                                                                getOrCreateParentStages(\n  rdd: RDD[_],\n  firstJobId: Int): List[Stage]\n

                                                                                                                                                                                                getOrCreateParentStages finds all direct parent ShuffleDependencies of the input rdd and then finds ShuffleMapStages for each ShuffleDependency.

                                                                                                                                                                                                getOrCreateParentStages is used when DAGScheduler is requested to create a ShuffleMapStage or a ResultStage.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#markStageAsFinished","title":"Marking Stage Finished
                                                                                                                                                                                                markStageAsFinished(\n  stage: Stage,\n  errorMessage: Option[String] = None,\n  willRetry: Boolean = false): Unit\n

                                                                                                                                                                                                markStageAsFinished...FIXME

                                                                                                                                                                                                markStageAsFinished is used when...FIXME

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#getOrCreateShuffleMapStage","title":"Looking Up ShuffleMapStage for ShuffleDependency
                                                                                                                                                                                                getOrCreateShuffleMapStage(\n  shuffleDep: ShuffleDependency[_, _, _],\n  firstJobId: Int): ShuffleMapStage\n

                                                                                                                                                                                                getOrCreateShuffleMapStage finds a ShuffleMapStage by the shuffleId of the given ShuffleDependency in the shuffleIdToMapStage internal registry and returns it if available.

                                                                                                                                                                                                If not found, getOrCreateShuffleMapStage finds all the missing ancestor shuffle dependencies and creates the missing ShuffleMapStage stages (including one for the input ShuffleDependency).

                                                                                                                                                                                                getOrCreateShuffleMapStage is used when:

                                                                                                                                                                                                • DAGScheduler is requested to find or create missing direct parent ShuffleMapStages of an RDD, find missing parent ShuffleMapStages for a stage, handle a MapStageSubmitted event, and check out stage dependency on a stage
                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#getMissingAncestorShuffleDependencies","title":"Missing ShuffleDependencies of RDD","text":"
                                                                                                                                                                                                getMissingAncestorShuffleDependencies(\n   rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]]\n

                                                                                                                                                                                                getMissingAncestorShuffleDependencies finds all the missing ShuffleDependencies for the given RDD (traversing its RDD lineage).

                                                                                                                                                                                                Note

                                                                                                                                                                                                A ShuffleDependency (of an RDD) is considered missing when not registered in the shuffleIdToMapStage internal registry.

                                                                                                                                                                                                Internally, getMissingAncestorShuffleDependencies finds direct parent shuffle dependencies\u2009of the input RDD and collects the ones that are not registered in the shuffleIdToMapStage internal registry. It repeats the process for the RDDs of the parent shuffle dependencies.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#getShuffleDependencies","title":"Finding Direct Parent Shuffle Dependencies of RDD
                                                                                                                                                                                                getShuffleDependencies(\n   rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]]\n

                                                                                                                                                                                                getShuffleDependencies finds direct parent shuffle dependencies for the given RDD.

                                                                                                                                                                                                Internally, getShuffleDependencies takes the direct rdd/index.md#dependencies[shuffle dependencies of the input RDD] and direct shuffle dependencies of all the parent non-ShuffleDependencies in the RDD lineage.

                                                                                                                                                                                                getShuffleDependencies is used when DAGScheduler is requested to find or create missing direct parent ShuffleMapStages (for ShuffleDependencies of a RDD) and find all missing shuffle dependencies for a given RDD.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#failJobAndIndependentStages","title":"Failing Job and Independent Single-Job Stages
                                                                                                                                                                                                failJobAndIndependentStages(\n  job: ActiveJob,\n  failureReason: String,\n  exception: Option[Throwable] = None): Unit\n

                                                                                                                                                                                                failJobAndIndependentStages fails the input job and all the stages that are only used by the job.

                                                                                                                                                                                                Internally, failJobAndIndependentStages uses jobIdToStageIds internal registry to look up the stages registered for the job.

                                                                                                                                                                                                If no stages could be found, you should see the following ERROR message in the logs:

                                                                                                                                                                                                No stages registered for job [id]\n

                                                                                                                                                                                                Otherwise, for every stage, failJobAndIndependentStages finds the job ids the stage belongs to.

                                                                                                                                                                                                If no stages could be found or the job is not referenced by the stages, you should see the following ERROR message in the logs:

                                                                                                                                                                                                Job [id] not registered for stage [id] even though that stage was registered for the job\n

                                                                                                                                                                                                Only when there is exactly one job registered for the stage and the stage is in RUNNING state (in runningStages internal registry), TaskScheduler.md#contract[TaskScheduler is requested to cancel the stage's tasks] and marks the stage finished.

                                                                                                                                                                                                NOTE: failJobAndIndependentStages uses jobIdToStageIds, stageIdToStage, and runningStages internal registries.

                                                                                                                                                                                                failJobAndIndependentStages is used when...FIXME

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#abortStage","title":"Aborting Stage
                                                                                                                                                                                                abortStage(\n  failedStage: Stage,\n  reason: String,\n  exception: Option[Throwable]): Unit\n

                                                                                                                                                                                                abortStage is an internal method that finds all the active jobs that depend on the failedStage stage and fails them.

                                                                                                                                                                                                Internally, abortStage looks the failedStage stage up in the internal stageIdToStage registry and exits if there the stage was not registered earlier.

                                                                                                                                                                                                If it was, abortStage finds all the active jobs (in the internal activeJobs registry) with the final stage depending on the failedStage stage.

                                                                                                                                                                                                At this time, the completionTime property (of the failed stage's StageInfo) is assigned to the current time (millis).

                                                                                                                                                                                                All the active jobs that depend on the failed stage (as calculated above) and the stages that do not belong to other jobs (aka independent stages) are failed (with the failure reason being \"Job aborted due to stage failure: [reason]\" and the input exception).

                                                                                                                                                                                                If there are no jobs depending on the failed stage, you should see the following INFO message in the logs:

                                                                                                                                                                                                Ignoring failure of [failedStage] because all jobs depending on it are done\n

                                                                                                                                                                                                abortStage is used when DAGScheduler is requested to handle a TaskSetFailed event, submit a stage, submit missing tasks of a stage, handle a TaskCompletion event.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#stageDependsOn","title":"Checking Out Stage Dependency on Given Stage
                                                                                                                                                                                                stageDependsOn(\n  stage: Stage,\n  target: Stage): Boolean\n

                                                                                                                                                                                                stageDependsOn compares two stages and returns whether the stage depends on target stage (i.e. true) or not (i.e. false).

                                                                                                                                                                                                NOTE: A stage A depends on stage B if B is among the ancestors of A.

                                                                                                                                                                                                Internally, stageDependsOn walks through the graph of RDDs of the input stage. For every RDD in the RDD's dependencies (using RDD.dependencies) stageDependsOn adds the RDD of a NarrowDependency to a stack of RDDs to visit while for a ShuffleDependency it finds ShuffleMapStage stages for a ShuffleDependency for the dependency and the stage's first job id that it later adds to a stack of RDDs to visit if the map stage is ready, i.e. all the partitions have shuffle outputs.

                                                                                                                                                                                                After all the RDDs of the input stage are visited, stageDependsOn checks if the target's RDD is among the RDDs of the stage, i.e. whether the stage depends on target stage.

                                                                                                                                                                                                stageDependsOn is used when DAGScheduler is requested to abort a stage.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#submitWaitingChildStages","title":"Submitting Waiting Child Stages for Execution
                                                                                                                                                                                                submitWaitingChildStages(\n  parent: Stage): Unit\n

                                                                                                                                                                                                submitWaitingChildStages submits for execution all waiting stages for which the input parent Stage.md[Stage] is the direct parent.

                                                                                                                                                                                                NOTE: Waiting stages are the stages registered in waitingStages internal registry.

                                                                                                                                                                                                When executed, you should see the following TRACE messages in the logs:

                                                                                                                                                                                                Checking if any dependencies of [parent] are now runnable\nrunning: [runningStages]\nwaiting: [waitingStages]\nfailed: [failedStages]\n

                                                                                                                                                                                                submitWaitingChildStages finds child stages of the input parent stage, removes them from waitingStages internal registry, and submits one by one sorted by their job ids.

                                                                                                                                                                                                submitWaitingChildStages is used when DAGScheduler is requested to submits missing tasks for a stage and handles a successful ShuffleMapTask completion.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#submitStage","title":"Submitting Stage (with Missing Parents) for Execution
                                                                                                                                                                                                submitStage(\n  stage: Stage): Unit\n

                                                                                                                                                                                                submitStage submits the input stage or its missing parents (if there any stages not computed yet before the input stage could).

                                                                                                                                                                                                NOTE: submitStage is also used to DAGSchedulerEventProcessLoop.md#resubmitFailedStages[resubmit failed stages].

                                                                                                                                                                                                submitStage recursively submits any missing parents of the stage.

                                                                                                                                                                                                Internally, submitStage first finds the earliest-created job id that needs the stage.

                                                                                                                                                                                                NOTE: A stage itself tracks the jobs (their ids) it belongs to (using the internal jobIds registry).

                                                                                                                                                                                                The following steps depend on whether there is a job or not.

                                                                                                                                                                                                If there are no jobs that require the stage, submitStage aborts it with the reason:

                                                                                                                                                                                                No active job for stage [id]\n

                                                                                                                                                                                                If however there is a job for the stage, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                submitStage([stage])\n

                                                                                                                                                                                                submitStage checks the status of the stage and continues when it was not recorded in waiting, running or failed internal registries. It simply exits otherwise.

                                                                                                                                                                                                With the stage ready for submission, submitStage calculates the list of missing parent stages of the stage (sorted by their job ids). You should see the following DEBUG message in the logs:

                                                                                                                                                                                                missing: [missing]\n

                                                                                                                                                                                                When the stage has no parent stages missing, you should see the following INFO message in the logs:

                                                                                                                                                                                                Submitting [stage] ([stage.rdd]), which has no missing parents\n

                                                                                                                                                                                                submitStage submits the stage (with the earliest-created job id) and finishes.

                                                                                                                                                                                                If however there are missing parent stages for the stage, submitStage submits all the parent stages, and the stage is recorded in the internal waitingStages registry.

                                                                                                                                                                                                submitStage is used recursively for missing parents of the given stage and when DAGScheduler is requested for the following:

                                                                                                                                                                                                • resubmitFailedStages (ResubmitFailedStages event)

                                                                                                                                                                                                • submitWaitingChildStages (CompletionEvent event)

                                                                                                                                                                                                • Handle JobSubmitted, MapStageSubmitted and TaskCompletion events

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#stage-attempts","title":"Stage Attempts

                                                                                                                                                                                                A single stage can be re-executed in multiple attempts due to fault recovery. The number of attempts is configured (FIXME).

                                                                                                                                                                                                If TaskScheduler reports that a task failed because a map output file from a previous stage was lost, the DAGScheduler resubmits the lost stage. This is detected through a DAGSchedulerEventProcessLoop.md#handleTaskCompletion-FetchFailed[CompletionEvent with FetchFailed], or an ExecutorLost event. DAGScheduler will wait a small amount of time to see whether other nodes or tasks fail, then resubmit TaskSets for any lost stage(s) that compute the missing tasks.

                                                                                                                                                                                                Please note that tasks from the old attempts of a stage could still be running.

                                                                                                                                                                                                A stage object tracks multiple StageInfo objects to pass to Spark listeners or the web UI.

                                                                                                                                                                                                The latest StageInfo for the most recent attempt for a stage is accessible through latestInfo.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#preferred-locations","title":"Preferred Locations

                                                                                                                                                                                                DAGScheduler computes where to run each task in a stage based on the rdd/index.md#getPreferredLocations[preferred locations of its underlying RDDs], or the location of cached or shuffle data.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#adaptive-query-planning","title":"Adaptive Query Planning / Adaptive Scheduling

                                                                                                                                                                                                See SPARK-9850 Adaptive execution in Spark for the design document. The work is currently in progress.

                                                                                                                                                                                                DAGScheduler.submitMapStage method is used for adaptive query planning, to run map stages and look at statistics about their outputs before submitting downstream stages.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#scheduledexecutorservice-daemon-services","title":"ScheduledExecutorService daemon services

                                                                                                                                                                                                DAGScheduler uses the following ScheduledThreadPoolExecutors (with the policy of removing cancelled tasks from a work queue at time of cancellation):

                                                                                                                                                                                                • dag-scheduler-message - a daemon thread pool using j.u.c.ScheduledThreadPoolExecutor with core pool size 1. It is used to post a DAGSchedulerEventProcessLoop.md#ResubmitFailedStages[ResubmitFailedStages] event when DAGSchedulerEventProcessLoop.md#handleTaskCompletion-FetchFailed[FetchFailed is reported].

                                                                                                                                                                                                They are created using ThreadUtils.newDaemonSingleThreadScheduledExecutor method that uses Guava DSL to instantiate a ThreadFactory.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#getMissingParentStages","title":"Finding Missing Parent ShuffleMapStages For Stage
                                                                                                                                                                                                getMissingParentStages(\n  stage: Stage): List[Stage]\n

                                                                                                                                                                                                getMissingParentStages finds missing parent ShuffleMapStages in the dependency graph of the input stage (using the breadth-first search algorithm).

                                                                                                                                                                                                Internally, getMissingParentStages starts with the stage's RDD and walks up the tree of all parent RDDs to find uncached partitions.

                                                                                                                                                                                                NOTE: A Stage tracks the associated RDD using Stage.md#rdd[rdd property].

                                                                                                                                                                                                NOTE: An uncached partition of a RDD is a partition that has Nil in the internal registry of partition locations per RDD (which results in no RDD blocks in any of the active storage:BlockManager.md[BlockManager]s on executors).

                                                                                                                                                                                                getMissingParentStages traverses the rdd/index.md#dependencies[parent dependencies of the RDD] and acts according to their type, i.e. ShuffleDependency or NarrowDependency.

                                                                                                                                                                                                NOTE: ShuffleDependency and NarrowDependency are the main top-level Dependencies.

                                                                                                                                                                                                For each NarrowDependency, getMissingParentStages simply marks the corresponding RDD to visit and moves on to a next dependency of a RDD or works on another unvisited parent RDD.

                                                                                                                                                                                                NOTE: NarrowDependency is a RDD dependency that allows for pipelined execution.

                                                                                                                                                                                                getMissingParentStages focuses on ShuffleDependency dependencies.

                                                                                                                                                                                                NOTE: ShuffleDependency is a RDD dependency that represents a dependency on the output of a ShuffleMapStage, i.e. shuffle map stage.

                                                                                                                                                                                                For each ShuffleDependency, getMissingParentStages finds ShuffleMapStage stages. If the ShuffleMapStage is not available, it is added to the set of missing (map) stages.

                                                                                                                                                                                                NOTE: A ShuffleMapStage is available when all its partitions are computed, i.e. results are available (as blocks).

                                                                                                                                                                                                CAUTION: FIXME...IMAGE with ShuffleDependencies queried

                                                                                                                                                                                                getMissingParentStages is used when DAGScheduler is requested to submit a stage and handle JobSubmitted and MapStageSubmitted events.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#submitMissingTasks","title":"Submitting Missing Tasks of Stage
                                                                                                                                                                                                submitMissingTasks(\n  stage: Stage,\n  jobId: Int): Unit\n

                                                                                                                                                                                                submitMissingTasks prints out the following DEBUG message to the logs:

                                                                                                                                                                                                submitMissingTasks([stage])\n

                                                                                                                                                                                                submitMissingTasks requests the given Stage for the missing partitions (partitions that need to be computed).

                                                                                                                                                                                                submitMissingTasks adds the stage to the runningStages internal registry.

                                                                                                                                                                                                submitMissingTasks notifies the OutputCommitCoordinator that stage execution started.

                                                                                                                                                                                                submitMissingTasks determines preferred locations (task locality preferences) of the missing partitions.

                                                                                                                                                                                                submitMissingTasks requests the stage for a new stage attempt.

                                                                                                                                                                                                submitMissingTasks requests the LiveListenerBus to post a SparkListenerStageSubmitted event.

                                                                                                                                                                                                submitMissingTasks uses the closure Serializer to serialize the stage and create a so-called task binary. submitMissingTasks serializes the RDD (of the stage) and either the ShuffleDependency or the compute function based on the type of the stage (ShuffleMapStage or ResultStage, respectively).

                                                                                                                                                                                                submitMissingTasks creates a broadcast variable for the task binary.

                                                                                                                                                                                                Note

                                                                                                                                                                                                That shows how important broadcast variables are for Spark itself to distribute data among executors in a Spark application in the most efficient way.

                                                                                                                                                                                                submitMissingTasks creates tasks for every missing partition:

                                                                                                                                                                                                • ShuffleMapTasks for a ShuffleMapStage

                                                                                                                                                                                                • ResultTasks for a ResultStage

                                                                                                                                                                                                If there are tasks to submit for execution (i.e. there are missing partitions in the stage), submitMissingTasks prints out the following INFO message to the logs:

                                                                                                                                                                                                Submitting [size] missing tasks from [stage] ([rdd]) (first 15 tasks are for partitions [partitionIds])\n

                                                                                                                                                                                                submitMissingTasks requests the TaskScheduler to TaskScheduler.md#submitTasks[submit the tasks for execution] (as a new TaskSet.md[TaskSet]).

                                                                                                                                                                                                With no tasks to submit for execution, submitMissingTasks marks the stage as finished successfully.

                                                                                                                                                                                                submitMissingTasks prints out the following DEBUG messages based on the type of the stage:

                                                                                                                                                                                                Stage [stage] is actually done; (available: [isAvailable],available outputs: [numAvailableOutputs],partitions: [numPartitions])\n

                                                                                                                                                                                                or

                                                                                                                                                                                                Stage [stage] is actually done; (partitions: [numPartitions])\n

                                                                                                                                                                                                for ShuffleMapStage and ResultStage, respectively.

                                                                                                                                                                                                In the end, with no tasks to submit for execution, submitMissingTasks submits waiting child stages for execution and exits.

                                                                                                                                                                                                submitMissingTasks is used when DAGScheduler is requested to submit a stage for execution.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#getPreferredLocs","title":"Finding Preferred Locations for Missing Partitions
                                                                                                                                                                                                getPreferredLocs(\n   rdd: RDD[_],\n  partition: Int): Seq[TaskLocation]\n

                                                                                                                                                                                                getPreferredLocs is simply an alias for the internal (recursive) getPreferredLocsInternal.

                                                                                                                                                                                                getPreferredLocs is used when...FIXME

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#getCacheLocs","title":"Finding BlockManagers (Executors) for Cached RDD Partitions (aka Block Location Discovery)
                                                                                                                                                                                                getCacheLocs(\n   rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]]\n

                                                                                                                                                                                                getCacheLocs gives TaskLocations (block locations) for the partitions of the input rdd. getCacheLocs caches lookup results in cacheLocs internal registry.

                                                                                                                                                                                                NOTE: The size of the collection from getCacheLocs is exactly the number of partitions in rdd RDD.

                                                                                                                                                                                                NOTE: The size of every TaskLocation collection (i.e. every entry in the result of getCacheLocs) is exactly the number of blocks managed using storage:BlockManager.md[BlockManagers] on executors.

                                                                                                                                                                                                Internally, getCacheLocs finds rdd in the cacheLocs internal registry (of partition locations per RDD).

                                                                                                                                                                                                If rdd is not in cacheLocs internal registry, getCacheLocs branches per its storage:StorageLevel.md[storage level].

                                                                                                                                                                                                For NONE storage level (i.e. no caching), the result is an empty locations (i.e. no location preference).

                                                                                                                                                                                                For other non-NONE storage levels, getCacheLocs storage:BlockManagerMaster.md#getLocations-block-array[requests BlockManagerMaster for block locations] that are then mapped to TaskLocations with the hostname of the owning BlockManager for a block (of a partition) and the executor id.

                                                                                                                                                                                                getCacheLocs records the computed block locations per partition (as TaskLocation) in cacheLocs internal registry.

                                                                                                                                                                                                NOTE: getCacheLocs requests locations from BlockManagerMaster using storage:BlockId.md#RDDBlockId[RDDBlockId] with the RDD id and the partition indices (which implies that the order of the partitions matters to request proper blocks).

                                                                                                                                                                                                NOTE: DAGScheduler uses TaskLocation.md[TaskLocations] (with host and executor) while storage:BlockManagerMaster.md[BlockManagerMaster] uses storage:BlockManagerId.md[] (to track similar information, i.e. block locations).

                                                                                                                                                                                                getCacheLocs is used when DAGScheduler is requested to find missing parent MapStages and getPreferredLocsInternal.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#getPreferredLocsInternal","title":"Finding Placement Preferences for RDD Partition (recursively)
                                                                                                                                                                                                getPreferredLocsInternal(\n   rdd: RDD[_],\n  partition: Int,\n  visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation]\n

                                                                                                                                                                                                getPreferredLocsInternal first finds the TaskLocations for the partition of the rdd (using cacheLocs internal cache) and returns them.

                                                                                                                                                                                                Otherwise, if not found, getPreferredLocsInternal rdd/index.md#preferredLocations[requests rdd for the preferred locations of partition] and returns them.

                                                                                                                                                                                                NOTE: Preferred locations of the partitions of a RDD are also called placement preferences or locality preferences.

                                                                                                                                                                                                Otherwise, if not found, getPreferredLocsInternal finds the first parent NarrowDependency and (recursively) finds TaskLocations.

                                                                                                                                                                                                If all the attempts fail to yield any non-empty result, getPreferredLocsInternal returns an empty collection of TaskLocation.md[TaskLocations].

                                                                                                                                                                                                getPreferredLocsInternal is used when DAGScheduler is requested for the preferred locations for missing partitions.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#stop","title":"Stopping DAGScheduler
                                                                                                                                                                                                stop(): Unit\n

                                                                                                                                                                                                stop stops the internal dag-scheduler-message thread pool, dag-scheduler-event-loop, and TaskScheduler.

                                                                                                                                                                                                stop is used when SparkContext is requested to stop.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#killTaskAttempt","title":"Killing Task
                                                                                                                                                                                                killTaskAttempt(\n  taskId: Long,\n  interruptThread: Boolean,\n  reason: String): Boolean\n

                                                                                                                                                                                                killTaskAttempt requests the TaskScheduler to kill a task.

                                                                                                                                                                                                killTaskAttempt is used when SparkContext is requested to kill a task.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#cleanUpAfterSchedulerStop","title":"cleanUpAfterSchedulerStop
                                                                                                                                                                                                cleanUpAfterSchedulerStop(): Unit\n

                                                                                                                                                                                                cleanUpAfterSchedulerStop...FIXME

                                                                                                                                                                                                cleanUpAfterSchedulerStop is used when DAGSchedulerEventProcessLoop is requested to onStop.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#removeExecutorAndUnregisterOutputs","title":"removeExecutorAndUnregisterOutputs
                                                                                                                                                                                                removeExecutorAndUnregisterOutputs(\n  execId: String,\n  fileLost: Boolean,\n  hostToUnregisterOutputs: Option[String],\n  maybeEpoch: Option[Long] = None): Unit\n

                                                                                                                                                                                                removeExecutorAndUnregisterOutputs...FIXME

                                                                                                                                                                                                removeExecutorAndUnregisterOutputs is used when DAGScheduler is requested to handle task completion (due to a fetch failure) and executor lost events.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#markMapStageJobsAsFinished","title":"markMapStageJobsAsFinished
                                                                                                                                                                                                markMapStageJobsAsFinished(\n  shuffleStage: ShuffleMapStage): Unit\n

                                                                                                                                                                                                markMapStageJobsAsFinished checks out whether the given ShuffleMapStage is fully-available yet there are still map-stage jobs running.

                                                                                                                                                                                                If so, markMapStageJobsAsFinished requests the MapOutputTrackerMaster for the statistics (for the ShuffleDependency of the given ShuffleMapStage).

                                                                                                                                                                                                For every map-stage job, markMapStageJobsAsFinished marks the map-stage job as finished (with the statistics).

                                                                                                                                                                                                markMapStageJobsAsFinished is used when:

                                                                                                                                                                                                • DAGScheduler is requested to submit missing tasks (of a ShuffleMapStage that has just been computed) and processShuffleMapStageCompletion
                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#processShuffleMapStageCompletion","title":"processShuffleMapStageCompletion
                                                                                                                                                                                                processShuffleMapStageCompletion(\n  shuffleStage: ShuffleMapStage): Unit\n

                                                                                                                                                                                                processShuffleMapStageCompletion...FIXME

                                                                                                                                                                                                processShuffleMapStageCompletion is used when:

                                                                                                                                                                                                • DAGScheduler is requested to handleTaskCompletion and handleShuffleMergeFinalized
                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#handleShuffleMergeFinalized","title":"handleShuffleMergeFinalized
                                                                                                                                                                                                handleShuffleMergeFinalized(\n  stage: ShuffleMapStage): Unit\n

                                                                                                                                                                                                handleShuffleMergeFinalized...FIXME

                                                                                                                                                                                                handleShuffleMergeFinalized is used when:

                                                                                                                                                                                                • DAGSchedulerEventProcessLoop is requested to handle a ShuffleMergeFinalized event
                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#scheduleShuffleMergeFinalize","title":"scheduleShuffleMergeFinalize
                                                                                                                                                                                                scheduleShuffleMergeFinalize(\n  stage: ShuffleMapStage): Unit\n

                                                                                                                                                                                                scheduleShuffleMergeFinalize...FIXME

                                                                                                                                                                                                scheduleShuffleMergeFinalize is used when:

                                                                                                                                                                                                • DAGScheduler is requested to handle a task completion
                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#finalizeShuffleMerge","title":"finalizeShuffleMerge","text":"
                                                                                                                                                                                                finalizeShuffleMerge(\n  stage: ShuffleMapStage): Unit\n

                                                                                                                                                                                                finalizeShuffleMerge...FIXME

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#updateJobIdStageIdMaps","title":"updateJobIdStageIdMaps
                                                                                                                                                                                                updateJobIdStageIdMaps(\n  jobId: Int,\n  stage: Stage): Unit\n

                                                                                                                                                                                                updateJobIdStageIdMaps...FIXME

                                                                                                                                                                                                updateJobIdStageIdMaps is used when DAGScheduler is requested to create ShuffleMapStage and ResultStage stages.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#executorHeartbeatReceived","title":"executorHeartbeatReceived
                                                                                                                                                                                                executorHeartbeatReceived(\n  execId: String,\n  // (taskId, stageId, stageAttemptId, accumUpdates)\n  accumUpdates: Array[(Long, Int, Int, Seq[AccumulableInfo])],\n  blockManagerId: BlockManagerId,\n  // (stageId, stageAttemptId) -> metrics\n  executorUpdates: mutable.Map[(Int, Int), ExecutorMetrics]): Boolean\n

                                                                                                                                                                                                executorHeartbeatReceived posts a SparkListenerExecutorMetricsUpdate (to listenerBus) and informs BlockManagerMaster that blockManagerId block manager is alive (by posting BlockManagerHeartbeat).

                                                                                                                                                                                                executorHeartbeatReceived is used when TaskSchedulerImpl is requested to handle an executor heartbeat.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#event-handlers","title":"Event Handlers","text":""},{"location":"scheduler/DAGScheduler/#doCancelAllJobs","title":"AllJobsCancelled Event Handler","text":"
                                                                                                                                                                                                doCancelAllJobs(): Unit\n

                                                                                                                                                                                                doCancelAllJobs...FIXME

                                                                                                                                                                                                doCancelAllJobs is used when DAGSchedulerEventProcessLoop is requested to handle an AllJobsCancelled event and onError.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleBeginEvent","title":"BeginEvent Event Handler","text":"
                                                                                                                                                                                                handleBeginEvent(\n  task: Task[_],\n  taskInfo: TaskInfo): Unit\n

                                                                                                                                                                                                handleBeginEvent...FIXME

                                                                                                                                                                                                handleBeginEvent is used when DAGSchedulerEventProcessLoop is requested to handle a BeginEvent event.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleTaskCompletion","title":"Handling Task Completion Event","text":"
                                                                                                                                                                                                handleTaskCompletion(\n  event: CompletionEvent): Unit\n

                                                                                                                                                                                                handleTaskCompletion handles a CompletionEvent.

                                                                                                                                                                                                handleTaskCompletion notifies the OutputCommitCoordinator that a task completed.

                                                                                                                                                                                                handleTaskCompletion finds the stage in the stageIdToStage registry. If not found, handleTaskCompletion postTaskEnd and quits.

                                                                                                                                                                                                handleTaskCompletion updateAccumulators.

                                                                                                                                                                                                handleTaskCompletion announces task completion application-wide.

                                                                                                                                                                                                handleTaskCompletion branches off per TaskEndReason (as event.reason).

                                                                                                                                                                                                TaskEndReason Description Success Acts according to the type of the task that completed, i.e. ShuffleMapTask and ResultTask Resubmitted others"},{"location":"scheduler/DAGScheduler/#handleTaskCompletion-Success","title":"Handling Successful Task Completion","text":"

                                                                                                                                                                                                When a task has finished successfully (i.e. Success end reason), handleTaskCompletion marks the partition as no longer pending (i.e. the partition the task worked on is removed from pendingPartitions of the stage).

                                                                                                                                                                                                NOTE: A Stage tracks its own pending partitions using scheduler:Stage.md#pendingPartitions[pendingPartitions property].

                                                                                                                                                                                                handleTaskCompletion branches off given the type of the task that completed, i.e. ShuffleMapTask and ResultTask.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleTaskCompletion-Success-ResultTask","title":"Handling Successful ResultTask Completion","text":"

                                                                                                                                                                                                For scheduler:ResultTask.md[ResultTask], the stage is assumed a scheduler:ResultStage.md[ResultStage].

                                                                                                                                                                                                handleTaskCompletion finds the ActiveJob associated with the ResultStage.

                                                                                                                                                                                                NOTE: scheduler:ResultStage.md[ResultStage] tracks the optional ActiveJob as scheduler:ResultStage.md#activeJob[activeJob property]. There could only be one active job for a ResultStage.

                                                                                                                                                                                                If there is no job for the ResultStage, you should see the following INFO message in the logs:

                                                                                                                                                                                                Ignoring result from [task] because its job has finished\n

                                                                                                                                                                                                Otherwise, when the ResultStage has a ActiveJob, handleTaskCompletion checks the status of the partition output for the partition the ResultTask ran for.

                                                                                                                                                                                                NOTE: ActiveJob tracks task completions in finished property with flags for every partition in a stage. When the flag for a partition is enabled (i.e. true), it is assumed that the partition has been computed (and no results from any ResultTask are expected and hence simply ignored).

                                                                                                                                                                                                CAUTION: FIXME Describe why could a partition has more ResultTask running.

                                                                                                                                                                                                handleTaskCompletion ignores the CompletionEvent when the partition has already been marked as completed for the stage and simply exits.

                                                                                                                                                                                                handleTaskCompletion scheduler:DAGScheduler.md#updateAccumulators[updates accumulators].

                                                                                                                                                                                                The partition for the ActiveJob (of the ResultStage) is marked as computed and the number of partitions calculated increased.

                                                                                                                                                                                                NOTE: ActiveJob tracks what partitions have already been computed and their number.

                                                                                                                                                                                                If the ActiveJob has finished (when the number of partitions computed is exactly the number of partitions in a stage) handleTaskCompletion does the following (in order):

                                                                                                                                                                                                1. scheduler:DAGScheduler.md#markStageAsFinished[Marks ResultStage computed].
                                                                                                                                                                                                2. scheduler:DAGScheduler.md#cleanupStateForJobAndIndependentStages[Cleans up after ActiveJob and independent stages].
                                                                                                                                                                                                3. Announces the job completion application-wide (by posting a SparkListener.md#SparkListenerJobEnd[SparkListenerJobEnd] to scheduler:LiveListenerBus.md[]).

                                                                                                                                                                                                In the end, handleTaskCompletion notifies JobListener of the ActiveJob that the task succeeded.

                                                                                                                                                                                                NOTE: A task succeeded notification holds the output index and the result.

                                                                                                                                                                                                When the notification throws an exception (because it runs user code), handleTaskCompletion notifies JobListener about the failure (wrapping it inside a SparkDriverExecutionException exception).

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleTaskCompletion-Success-ShuffleMapTask","title":"Handling Successful ShuffleMapTask Completion","text":"

                                                                                                                                                                                                For scheduler:ShuffleMapTask.md[ShuffleMapTask], the stage is assumed a scheduler:ShuffleMapStage.md[ShuffleMapStage].

                                                                                                                                                                                                handleTaskCompletion scheduler:DAGScheduler.md#updateAccumulators[updates accumulators].

                                                                                                                                                                                                The task's result is assumed scheduler:MapStatus.md[MapStatus] that knows the executor where the task has finished.

                                                                                                                                                                                                You should see the following DEBUG message in the logs:

                                                                                                                                                                                                ShuffleMapTask finished on [execId]\n

                                                                                                                                                                                                If the executor is registered in scheduler:DAGScheduler.md#failedEpoch[failedEpoch internal registry] and the epoch of the completed task is not greater than that of the executor (as in failedEpoch registry), you should see the following INFO message in the logs:

                                                                                                                                                                                                Ignoring possibly bogus [task] completion from executor [executorId]\n

                                                                                                                                                                                                Otherwise, handleTaskCompletion scheduler:ShuffleMapStage.md#addOutputLoc[registers the MapStatus result for the partition with the stage] (of the completed task).

                                                                                                                                                                                                handleTaskCompletion does more processing only if the ShuffleMapStage is registered as still running (in scheduler:DAGScheduler.md#runningStages[runningStages internal registry]) and the scheduler:Stage.md#pendingPartitions[ShuffleMapStage stage has no pending partitions to compute].

                                                                                                                                                                                                The ShuffleMapStage is marked as finished.

                                                                                                                                                                                                You should see the following INFO messages in the logs:

                                                                                                                                                                                                looking for newly runnable stages\nrunning: [runningStages]\nwaiting: [waitingStages]\nfailed: [failedStages]\n

                                                                                                                                                                                                handleTaskCompletion scheduler:MapOutputTrackerMaster.md#registerMapOutputs[registers the shuffle map outputs of the ShuffleDependency with MapOutputTrackerMaster] (with the epoch incremented) and scheduler:DAGScheduler.md#clearCacheLocs[clears internal cache of the stage's RDD block locations].

                                                                                                                                                                                                NOTE: scheduler:MapOutputTrackerMaster.md[MapOutputTrackerMaster] is given when scheduler:DAGScheduler.md#creating-instance[DAGScheduler is created].

                                                                                                                                                                                                If the scheduler:ShuffleMapStage.md#isAvailable[ShuffleMapStage stage is ready], all scheduler:ShuffleMapStage.md#mapStageJobs[active jobs of the stage] (aka map-stage jobs) are scheduler:DAGScheduler.md#markMapStageJobAsFinished[marked as finished] (with scheduler:MapOutputTrackerMaster.md#getStatistics[MapOutputStatistics from MapOutputTrackerMaster for the ShuffleDependency]).

                                                                                                                                                                                                NOTE: A ShuffleMapStage stage is ready (aka available) when all partitions have shuffle outputs, i.e. when their tasks have completed.

                                                                                                                                                                                                Eventually, handleTaskCompletion scheduler:DAGScheduler.md#submitWaitingChildStages[submits waiting child stages (of the ready ShuffleMapStage)].

                                                                                                                                                                                                If however the ShuffleMapStage is not ready, you should see the following INFO message in the logs:

                                                                                                                                                                                                Resubmitting [shuffleStage] ([shuffleStage.name]) because some of its tasks had failed: [missingPartitions]\n

                                                                                                                                                                                                In the end, handleTaskCompletion scheduler:DAGScheduler.md#submitStage[submits the ShuffleMapStage for execution].

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleTaskCompletion-Resubmitted","title":"TaskEndReason: Resubmitted","text":"

                                                                                                                                                                                                For Resubmitted case, you should see the following INFO message in the logs:

                                                                                                                                                                                                Resubmitted [task], so marking it as still running\n

                                                                                                                                                                                                The task (by task.partitionId) is added to the collection of pending partitions of the stage (using stage.pendingPartitions).

                                                                                                                                                                                                TIP: A stage knows how many partitions are yet to be calculated. A task knows about the partition id for which it was launched.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleTaskCompletion-FetchFailed","title":"Task Failed with FetchFailed Exception","text":"
                                                                                                                                                                                                FetchFailed(\n  bmAddress: BlockManagerId,\n  shuffleId: Int,\n  mapId: Int,\n  reduceId: Int,\n  message: String)\nextends TaskFailedReason\n

                                                                                                                                                                                                When FetchFailed happens, stageIdToStage is used to access the failed stage (using task.stageId and the task is available in event in handleTaskCompletion(event: CompletionEvent)). shuffleToMapStage is used to access the map stage (using shuffleId).

                                                                                                                                                                                                If failedStage.latestInfo.attemptId != task.stageAttemptId, you should see the following INFO in the logs:

                                                                                                                                                                                                Ignoring fetch failure from [task] as it's from [failedStage] attempt [task.stageAttemptId] and there is a more recent attempt for that stage (attempt ID [failedStage.latestInfo.attemptId]) running\n

                                                                                                                                                                                                CAUTION: FIXME What does failedStage.latestInfo.attemptId != task.stageAttemptId mean?

                                                                                                                                                                                                And the case finishes. Otherwise, the case continues.

                                                                                                                                                                                                If the failed stage is in runningStages, the following INFO message shows in the logs:

                                                                                                                                                                                                Marking [failedStage] ([failedStage.name]) as failed due to a fetch failure from [mapStage] ([mapStage.name])\n

                                                                                                                                                                                                markStageAsFinished(failedStage, Some(failureMessage)) is called.

                                                                                                                                                                                                CAUTION: FIXME What does markStageAsFinished do?

                                                                                                                                                                                                If the failed stage is not in runningStages, the following DEBUG message shows in the logs:

                                                                                                                                                                                                Received fetch failure from [task], but its from [failedStage] which is no longer running\n

                                                                                                                                                                                                When disallowStageRetryForTest is set, abortStage(failedStage, \"Fetch failure will not retry stage due to testing config\", None) is called.

                                                                                                                                                                                                CAUTION: FIXME Describe disallowStageRetryForTest and abortStage.

                                                                                                                                                                                                If the scheduler:Stage.md#failedOnFetchAndShouldAbort[number of fetch failed attempts for the stage exceeds the allowed number], the scheduler:DAGScheduler.md#abortStage[failed stage is aborted] with the reason:

                                                                                                                                                                                                [failedStage] ([name]) has failed the maximum allowable number of times: 4. Most recent failure reason: [failureMessage]\n

                                                                                                                                                                                                If there are no failed stages reported (scheduler:DAGScheduler.md#failedStages[DAGScheduler.failedStages] is empty), the following INFO shows in the logs:

                                                                                                                                                                                                Resubmitting [mapStage] ([mapStage.name]) and [failedStage] ([failedStage.name]) due to fetch failure\n

                                                                                                                                                                                                And the following code is executed:

                                                                                                                                                                                                messageScheduler.schedule(\n  new Runnable {\n    override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages)\n  }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS)\n

                                                                                                                                                                                                CAUTION: FIXME What does the above code do?

                                                                                                                                                                                                For all the cases, the failed stage and map stages are both added to the internal scheduler:DAGScheduler.md#failedStages[registry of failed stages].

                                                                                                                                                                                                If mapId (in the FetchFailed object for the case) is provided, the map stage output is cleaned up (as it is broken) using mapStage.removeOutputLoc(mapId, bmAddress) and scheduler:MapOutputTracker.md#unregisterMapOutput[MapOutputTrackerMaster.unregisterMapOutput(shuffleId, mapId, bmAddress)] methods.

                                                                                                                                                                                                CAUTION: FIXME What does mapStage.removeOutputLoc do?

                                                                                                                                                                                                If BlockManagerId (as bmAddress in the FetchFailed object) is defined, handleTaskCompletion notifies DAGScheduler that an executor was lost (with filesLost enabled and maybeEpoch from the scheduler:Task.md#epoch[Task] that completed).

                                                                                                                                                                                                handleTaskCompletion is used when:

                                                                                                                                                                                                • DAGSchedulerEventProcessLoop is requested to handle a CompletionEvent event.
                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleExecutorAdded","title":"ExecutorAdded Event Handler","text":"
                                                                                                                                                                                                handleExecutorAdded(\n  execId: String,\n  host: String): Unit\n

                                                                                                                                                                                                handleExecutorAdded...FIXME

                                                                                                                                                                                                handleExecutorAdded is used when DAGSchedulerEventProcessLoop is requested to handle an ExecutorAdded event.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleExecutorLost","title":"ExecutorLost Event Handler","text":"
                                                                                                                                                                                                handleExecutorLost(\n  execId: String,\n  workerLost: Boolean): Unit\n

                                                                                                                                                                                                handleExecutorLost checks whether the input optional maybeEpoch is defined and if not requests the scheduler:MapOutputTracker.md#getEpoch[current epoch from MapOutputTrackerMaster].

                                                                                                                                                                                                NOTE: MapOutputTrackerMaster is passed in (as mapOutputTracker) when scheduler:DAGScheduler.md#creating-instance[DAGScheduler is created].

                                                                                                                                                                                                CAUTION: FIXME When is maybeEpoch passed in?

                                                                                                                                                                                                .DAGScheduler.handleExecutorLost image::dagscheduler-handleExecutorLost.png[align=\"center\"]

                                                                                                                                                                                                Recurring ExecutorLost events lead to the following repeating DEBUG message in the logs:

                                                                                                                                                                                                DEBUG Additional executor lost message for [execId] (epoch [currentEpoch])\n

                                                                                                                                                                                                NOTE: handleExecutorLost handler uses DAGScheduler's failedEpoch and FIXME internal registries.

                                                                                                                                                                                                Otherwise, when the executor execId is not in the scheduler:DAGScheduler.md#failedEpoch[list of executor lost] or the executor failure's epoch is smaller than the input maybeEpoch, the executor's lost event is recorded in scheduler:DAGScheduler.md#failedEpoch[failedEpoch internal registry].

                                                                                                                                                                                                CAUTION: FIXME Describe the case above in simpler non-technical words. Perhaps change the order, too.

                                                                                                                                                                                                You should see the following INFO message in the logs:

                                                                                                                                                                                                INFO Executor lost: [execId] (epoch [epoch])\n

                                                                                                                                                                                                storage:BlockManagerMaster.md#removeExecutor[BlockManagerMaster is requested to remove the lost executor execId].

                                                                                                                                                                                                CAUTION: FIXME Review what's filesLost.

                                                                                                                                                                                                handleExecutorLost exits unless the ExecutorLost event was for a map output fetch operation (and the input filesLost is true) or external shuffle service is not used.

                                                                                                                                                                                                In such a case, you should see the following INFO message in the logs:

                                                                                                                                                                                                Shuffle files lost for executor: [execId] (epoch [epoch])\n

                                                                                                                                                                                                handleExecutorLost walks over all scheduler:ShuffleMapStage.md[ShuffleMapStage]s in scheduler:DAGScheduler.md#shuffleToMapStage[DAGScheduler's shuffleToMapStage internal registry] and do the following (in order):

                                                                                                                                                                                                1. ShuffleMapStage.removeOutputsOnExecutor(execId) is called
                                                                                                                                                                                                2. scheduler:MapOutputTrackerMaster.md#registerMapOutputs[MapOutputTrackerMaster.registerMapOutputs(shuffleId, stage.outputLocInMapOutputTrackerFormat(), changeEpoch = true)] is called.

                                                                                                                                                                                                In case scheduler:DAGScheduler.md#shuffleToMapStage[DAGScheduler's shuffleToMapStage internal registry] has no shuffles registered, scheduler:MapOutputTrackerMaster.md#incrementEpoch[MapOutputTrackerMaster is requested to increment epoch].

                                                                                                                                                                                                Ultimatelly, DAGScheduler scheduler:DAGScheduler.md#clearCacheLocs[clears the internal cache of RDD partition locations].

                                                                                                                                                                                                handleExecutorLost is used when DAGSchedulerEventProcessLoop is requested to handle an ExecutorLost event.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleGetTaskResult","title":"GettingResultEvent Event Handler","text":"
                                                                                                                                                                                                handleGetTaskResult(\n  taskInfo: TaskInfo): Unit\n

                                                                                                                                                                                                handleGetTaskResult...FIXME

                                                                                                                                                                                                handleGetTaskResult is used when DAGSchedulerEventProcessLoop is requested to handle a GettingResultEvent event.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleJobCancellation","title":"JobCancelled Event Handler","text":"
                                                                                                                                                                                                handleJobCancellation(\n  jobId: Int,\n  reason: Option[String]): Unit\n

                                                                                                                                                                                                handleJobCancellation looks up the active job for the input job ID (in jobIdToActiveJob internal registry) and fails it and all associated independent stages with failure reason:

                                                                                                                                                                                                Job [jobId] cancelled [reason]\n

                                                                                                                                                                                                When the input job ID is not found, handleJobCancellation prints out the following DEBUG message to the logs:

                                                                                                                                                                                                Trying to cancel unregistered job [jobId]\n

                                                                                                                                                                                                handleJobCancellation is used when DAGScheduler is requested to handle a JobCancelled event, doCancelAllJobs, handleJobGroupCancelled, handleStageCancellation.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleJobGroupCancelled","title":"JobGroupCancelled Event Handler","text":"
                                                                                                                                                                                                handleJobGroupCancelled(\n  groupId: String): Unit\n

                                                                                                                                                                                                handleJobGroupCancelled finds active jobs in a group and cancels them.

                                                                                                                                                                                                Internally, handleJobGroupCancelled computes all the active jobs (registered in the internal collection of active jobs) that have spark.jobGroup.id scheduling property set to groupId.

                                                                                                                                                                                                handleJobGroupCancelled then cancels every active job in the group one by one and the cancellation reason:

                                                                                                                                                                                                part of cancelled job group [groupId]\n

                                                                                                                                                                                                handleJobGroupCancelled is used when DAGScheduler is requested to handle JobGroupCancelled event.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleJobSubmitted","title":"Handling JobSubmitted Event","text":"
                                                                                                                                                                                                handleJobSubmitted(\n  jobId: Int,\n  finalRDD: RDD[_],\n  func: (TaskContext, Iterator[_]) => _,\n  partitions: Array[Int],\n  callSite: CallSite,\n  listener: JobListener,\n  properties: Properties): Unit\n

                                                                                                                                                                                                handleJobSubmitted creates a ResultStage (finalStage) for the given RDD, func, partitions, jobId and callSite.

                                                                                                                                                                                                BarrierJobSlotsNumberCheckFailed Exception

                                                                                                                                                                                                Creating a ResultStage may fail with a BarrierJobSlotsNumberCheckFailed exception.

                                                                                                                                                                                                handleJobSubmitted removes the given jobId from the barrierJobIdToNumTasksCheckFailures.

                                                                                                                                                                                                handleJobSubmitted creates an ActiveJob for the ResultStage (with the given jobId, the callSite, the JobListener and the properties).

                                                                                                                                                                                                handleJobSubmitted clears the internal cache of RDD partition locations.

                                                                                                                                                                                                FIXME Why is this clearing here so important?

                                                                                                                                                                                                handleJobSubmitted prints out the following INFO messages to the logs (with missingParentStages):

                                                                                                                                                                                                Got job [id] ([callSite]) with [number] output partitions\nFinal stage: [finalStage] ([name])\nParents of final stage: [parents]\nMissing parents: [missingParentStages]\n

                                                                                                                                                                                                handleJobSubmitted registers the new ActiveJob in jobIdToActiveJob and activeJobs internal registries.

                                                                                                                                                                                                handleJobSubmitted requests the ResultStage to associate itself with the ActiveJob.

                                                                                                                                                                                                handleJobSubmitted uses the jobIdToStageIds internal registry to find all registered stages for the given jobId. handleJobSubmitted uses the stageIdToStage internal registry to request the Stages for the latestInfo.

                                                                                                                                                                                                In the end, handleJobSubmitted posts a SparkListenerJobStart message to the LiveListenerBus and submits the ResultStage.

                                                                                                                                                                                                handleJobSubmitted is used when:

                                                                                                                                                                                                • DAGSchedulerEventProcessLoop is requested to handle a JobSubmitted event
                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleJobSubmitted-BarrierJobSlotsNumberCheckFailed","title":"BarrierJobSlotsNumberCheckFailed","text":"

                                                                                                                                                                                                In case of a BarrierJobSlotsNumberCheckFailed exception while creating a ResultStage, handleJobSubmitted increments the number of failures in the barrierJobIdToNumTasksCheckFailures for the given jobId.

                                                                                                                                                                                                handleJobSubmitted prints out the following WARN message to the logs (with spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures):

                                                                                                                                                                                                Barrier stage in job [jobId] requires [requiredConcurrentTasks] slots, but only [maxConcurrentTasks] are available. Will retry up to [maxFailures] more times\n

                                                                                                                                                                                                If the number of failures is below the spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures threshold, handleJobSubmitted requests the messageScheduler to schedule a one-shot task that requests the DAGSchedulerEventProcessLoop to post a JobSubmitted event (after spark.scheduler.barrier.maxConcurrentTasksCheck.interval seconds).

                                                                                                                                                                                                Note

                                                                                                                                                                                                Posting a JobSubmitted event is to request the DAGScheduler to re-consider the request, hoping that there will be enough resources to fulfill the resource requirements of a barrier job.

                                                                                                                                                                                                Otherwise, if the number of failures crossed the spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures threshold, handleJobSubmitted removes the jobId from the barrierJobIdToNumTasksCheckFailures and informs the given JobListener that the jobFailed.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleMapStageSubmitted","title":"MapStageSubmitted","text":"
                                                                                                                                                                                                handleMapStageSubmitted(\n  jobId: Int,\n  dependency: ShuffleDependency[_, _, _],\n  callSite: CallSite,\n  listener: JobListener,\n  properties: Properties): Unit\n

                                                                                                                                                                                                Note

                                                                                                                                                                                                MapStageSubmitted event processing is very similar to JobSubmitted event's.

                                                                                                                                                                                                handleMapStageSubmitted finds or creates a new ShuffleMapStage for the given ShuffleDependency and jobId.

                                                                                                                                                                                                handleMapStageSubmitted creates an ActiveJob (with the given jobId, the ShuffleMapStage, the given JobListener).

                                                                                                                                                                                                handleMapStageSubmitted clears the internal cache of RDD partition locations.

                                                                                                                                                                                                handleMapStageSubmitted prints out the following INFO messages to the logs:

                                                                                                                                                                                                Got map stage job [id] ([callSite]) with [number] output partitions\nFinal stage: [stage] ([name])\nParents of final stage: [parents]\nMissing parents: [missingParentStages]\n

                                                                                                                                                                                                handleMapStageSubmitted adds the new ActiveJob to jobIdToActiveJob and activeJobs internal registries, and the ShuffleMapStage.

                                                                                                                                                                                                Note

                                                                                                                                                                                                ShuffleMapStage can have multiple ActiveJobs registered.

                                                                                                                                                                                                handleMapStageSubmitted finds all the registered stages for the input jobId and collects their latest StageInfo.

                                                                                                                                                                                                In the end, handleMapStageSubmitted posts a SparkListenerJobStart event to the LiveListenerBus and submits the ShuffleMapStage.

                                                                                                                                                                                                When the ShuffleMapStage is available already, handleMapStageSubmitted marks the job finished.

                                                                                                                                                                                                When handleMapStageSubmitted could not find or create a ShuffleMapStage, handleMapStageSubmitted prints out the following WARN message to the logs.

                                                                                                                                                                                                Creating new stage failed due to exception - job: [id]\n

                                                                                                                                                                                                handleMapStageSubmitted notifies the JobListener about the job failure and exits.

                                                                                                                                                                                                handleMapStageSubmitted is used when:

                                                                                                                                                                                                • DAGSchedulerEventProcessLoop is requested to handle a MapStageSubmitted event
                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#resubmitFailedStages","title":"ResubmitFailedStages Event Handler","text":"
                                                                                                                                                                                                resubmitFailedStages(): Unit\n

                                                                                                                                                                                                resubmitFailedStages iterates over the internal collection of failed stages and submits them.

                                                                                                                                                                                                Note

                                                                                                                                                                                                resubmitFailedStages does nothing when there are no failed stages reported.

                                                                                                                                                                                                resubmitFailedStages prints out the following INFO message to the logs:

                                                                                                                                                                                                Resubmitting failed stages\n

                                                                                                                                                                                                resubmitFailedStages clears the internal cache of RDD partition locations and makes a copy of the collection of failed stages to track failed stages afresh.

                                                                                                                                                                                                Note

                                                                                                                                                                                                At this point DAGScheduler has no failed stages reported.

                                                                                                                                                                                                The previously-reported failed stages are sorted by the corresponding job ids in incremental order and resubmitted.

                                                                                                                                                                                                resubmitFailedStages is used when DAGSchedulerEventProcessLoop is requested to handle a ResubmitFailedStages event.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleSpeculativeTaskSubmitted","title":"SpeculativeTaskSubmitted Event Handler","text":"
                                                                                                                                                                                                handleSpeculativeTaskSubmitted(): Unit\n

                                                                                                                                                                                                handleSpeculativeTaskSubmitted...FIXME

                                                                                                                                                                                                handleSpeculativeTaskSubmitted is used when DAGSchedulerEventProcessLoop is requested to handle a SpeculativeTaskSubmitted event.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleStageCancellation","title":"StageCancelled Event Handler","text":"
                                                                                                                                                                                                handleStageCancellation(): Unit\n

                                                                                                                                                                                                handleStageCancellation...FIXME

                                                                                                                                                                                                handleStageCancellation is used when DAGSchedulerEventProcessLoop is requested to handle a StageCancelled event.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleTaskSetFailed","title":"TaskSetFailed Event Handler","text":"
                                                                                                                                                                                                handleTaskSetFailed(): Unit\n

                                                                                                                                                                                                handleTaskSetFailed...FIXME

                                                                                                                                                                                                handleTaskSetFailed is used when DAGSchedulerEventProcessLoop is requested to handle a TaskSetFailed event.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#handleWorkerRemoved","title":"WorkerRemoved Event Handler","text":"
                                                                                                                                                                                                handleWorkerRemoved(\n  workerId: String,\n  host: String,\n  message: String): Unit\n

                                                                                                                                                                                                handleWorkerRemoved...FIXME

                                                                                                                                                                                                handleWorkerRemoved is used when DAGSchedulerEventProcessLoop is requested to handle a WorkerRemoved event.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#internal-properties","title":"Internal Properties","text":""},{"location":"scheduler/DAGScheduler/#failedEpoch","title":"failedEpoch","text":"

                                                                                                                                                                                                The lookup table of lost executors and the epoch of the event.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#failedStages","title":"failedStages","text":"

                                                                                                                                                                                                Stages that failed due to fetch failures (when a DAGSchedulerEventProcessLoop.md#handleTaskCompletion-FetchFailed[task fails with FetchFailed exception]).

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#jobIdToActiveJob","title":"jobIdToActiveJob","text":"

                                                                                                                                                                                                The lookup table of ActiveJobs per job id.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#jobIdToStageIds","title":"jobIdToStageIds","text":"

                                                                                                                                                                                                The lookup table of all stages per ActiveJob id

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#nextJobId","title":"nextJobId Counter","text":"
                                                                                                                                                                                                nextJobId: AtomicInteger\n

                                                                                                                                                                                                nextJobId is a Java AtomicInteger for job IDs.

                                                                                                                                                                                                nextJobId starts at 0.

                                                                                                                                                                                                Used when DAGScheduler is requested for numTotalJobs, to submitJob, runApproximateJob and submitMapStage.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#nextStageId","title":"nextStageId","text":"

                                                                                                                                                                                                The next stage id counting from 0.

                                                                                                                                                                                                Used when DAGScheduler creates a shuffle map stage and a result stage. It is the key in stageIdToStage.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#runningStages","title":"runningStages","text":"

                                                                                                                                                                                                The set of stages that are currently \"running\".

                                                                                                                                                                                                A stage is added when submitMissingTasks gets executed (without first checking if the stage has not already been added).

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#shuffleIdToMapStage","title":"shuffleIdToMapStage","text":"

                                                                                                                                                                                                A lookup table of ShuffleMapStages by ShuffleDependency

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#stageIdToStage","title":"stageIdToStage","text":"

                                                                                                                                                                                                A lookup table of stages by stage ID

                                                                                                                                                                                                Used when DAGScheduler creates a shuffle map stage, creates a result stage, cleans up job state and independent stages, is informed that a task is started, a taskset has failed, a job is submitted (to compute a ResultStage), a map stage was submitted, a task has completed or a stage was cancelled, updates accumulators, aborts a stage and fails a job and independent stages.

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#waitingStages","title":"waitingStages","text":"

                                                                                                                                                                                                Stages with parents to be computed

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#event-posting-methods","title":"Event Posting Methods","text":""},{"location":"scheduler/DAGScheduler/#cancelAllJobs","title":"Posting AllJobsCancelled","text":"

                                                                                                                                                                                                Posts an AllJobsCancelled

                                                                                                                                                                                                Used when SparkContext is requested to cancel all running or scheduled Spark jobs

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#cancelJob","title":"Posting JobCancelled","text":"

                                                                                                                                                                                                Posts a JobCancelled

                                                                                                                                                                                                Used when SparkContext or JobWaiter are requested to cancel a Spark job

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#cancelJobGroup","title":"Posting JobGroupCancelled","text":"

                                                                                                                                                                                                Posts a JobGroupCancelled

                                                                                                                                                                                                Used when SparkContext is requested to cancel a job group

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#cancelStage","title":"Posting StageCancelled","text":"

                                                                                                                                                                                                Posts a StageCancelled

                                                                                                                                                                                                Used when SparkContext is requested to cancel a stage

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#executorAdded","title":"Posting ExecutorAdded","text":"

                                                                                                                                                                                                Posts an ExecutorAdded

                                                                                                                                                                                                Used when TaskSchedulerImpl is requested to handle resource offers (and a new executor is found in the resource offers)

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#executorLost","title":"Posting ExecutorLost","text":"

                                                                                                                                                                                                Posts a ExecutorLost

                                                                                                                                                                                                Used when TaskSchedulerImpl is requested to handle a task status update (and a task gets lost which is used to indicate that the executor got broken and hence should be considered lost) or executorLost

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#runApproximateJob","title":"Posting JobSubmitted","text":"

                                                                                                                                                                                                Posts a JobSubmitted

                                                                                                                                                                                                Used when SparkContext is requested to run an approximate job

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#speculativeTaskSubmitted","title":"Posting SpeculativeTaskSubmitted","text":"

                                                                                                                                                                                                Posts a SpeculativeTaskSubmitted

                                                                                                                                                                                                Used when TaskSetManager is requested to checkAndSubmitSpeculatableTask

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#taskEnded","title":"Posting CompletionEvent","text":"

                                                                                                                                                                                                Posts a CompletionEvent

                                                                                                                                                                                                Used when TaskSetManager is requested to handleSuccessfulTask, handleFailedTask, and executorLost

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#taskGettingResult","title":"Posting GettingResultEvent","text":"

                                                                                                                                                                                                Posts a GettingResultEvent

                                                                                                                                                                                                Used when TaskSetManager is requested to handle a task fetching result

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#taskSetFailed","title":"Posting TaskSetFailed","text":"

                                                                                                                                                                                                Posts a TaskSetFailed

                                                                                                                                                                                                Used when TaskSetManager is requested to abort

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#taskStarted","title":"Posting BeginEvent","text":"

                                                                                                                                                                                                Posts a BeginEvent

                                                                                                                                                                                                Used when TaskSetManager is requested to start a task

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#workerRemoved","title":"Posting WorkerRemoved","text":"

                                                                                                                                                                                                Posts a WorkerRemoved

                                                                                                                                                                                                Used when TaskSchedulerImpl is requested to handle a removed worker event

                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#updateAccumulators","title":"Updating Accumulators of Completed Tasks
                                                                                                                                                                                                updateAccumulators(\n  event: CompletionEvent): Unit\n

                                                                                                                                                                                                updateAccumulators merges the partial values of accumulators from a completed task (based on the given CompletionEvent) into their \"source\" accumulators on the driver.

                                                                                                                                                                                                For every AccumulatorV2 update (in the given CompletionEvent), updateAccumulators finds the corresponding accumulator on the driver and requests the AccumulatorV2 to merge the updates.

                                                                                                                                                                                                updateAccumulators...FIXME

                                                                                                                                                                                                For named accumulators with the update value being a non-zero value, i.e. not Accumulable.zero:

                                                                                                                                                                                                • stage.latestInfo.accumulables for the AccumulableInfo.id is set
                                                                                                                                                                                                • CompletionEvent.taskInfo.accumulables has a new AccumulableInfo added.

                                                                                                                                                                                                CAUTION: FIXME Where are Stage.latestInfo.accumulables and CompletionEvent.taskInfo.accumulables used?

                                                                                                                                                                                                updateAccumulators is used when DAGScheduler is requested to handle a task completion.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#postTaskEnd","title":"Posting SparkListenerTaskEnd (at Task Completion)
                                                                                                                                                                                                postTaskEnd(\n  event: CompletionEvent): Unit\n

                                                                                                                                                                                                postTaskEnd reconstructs task metrics (from the accumulator updates in the CompletionEvent).

                                                                                                                                                                                                In the end, postTaskEnd creates a SparkListenerTaskEnd and requests the LiveListenerBus to post it.

                                                                                                                                                                                                postTaskEnd is used when:

                                                                                                                                                                                                • DAGScheduler is requested to handle a task completion
                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#checkBarrierStageWithNumSlots","title":"checkBarrierStageWithNumSlots
                                                                                                                                                                                                checkBarrierStageWithNumSlots(\n  rdd: RDD[_],\n  rp: ResourceProfile): Unit\n
                                                                                                                                                                                                Noop for Non-Barrier RDDs

                                                                                                                                                                                                Unless the given RDD is isBarrier, checkBarrierStageWithNumSlots does nothing (is a noop).

                                                                                                                                                                                                checkBarrierStageWithNumSlots requests the given RDD for the number of partitions.

                                                                                                                                                                                                checkBarrierStageWithNumSlots requests the SparkContext for the maximum number of concurrent tasks for the given ResourceProfile.

                                                                                                                                                                                                If the number of partitions (based on the RDD) is greater than the maximum number of concurrent tasks (based on the ResourceProfile), checkBarrierStageWithNumSlots reports a BarrierJobSlotsNumberCheckFailed exception.

                                                                                                                                                                                                checkBarrierStageWithNumSlots is used when:

                                                                                                                                                                                                • DAGScheduler is requested to create a ShuffleMapStage or a ResultStage stage
                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#utilities","title":"Utilities

                                                                                                                                                                                                Danger

                                                                                                                                                                                                The section includes (hides) utility methods that do not really contribute to the understanding of how DAGScheduler works internally.

                                                                                                                                                                                                It's very likely they should not even be part of this page.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGScheduler/#getShuffleDependenciesAndResourceProfiles","title":"Finding Shuffle Dependencies and ResourceProfiles of RDD","text":"
                                                                                                                                                                                                getShuffleDependenciesAndResourceProfiles(\n  rdd: RDD[_]): (HashSet[ShuffleDependency[_, _, _]], HashSet[ResourceProfile])\n

                                                                                                                                                                                                getShuffleDependenciesAndResourceProfiles returns the direct ShuffleDependencies and all the ResourceProfiles of the given RDD and parent non-shuffle RDDs, if available.

                                                                                                                                                                                                getShuffleDependenciesAndResourceProfiles collects ResourceProfiles of the given RDD and any parent RDDs, if available.

                                                                                                                                                                                                getShuffleDependenciesAndResourceProfiles collects direct ShuffleDependencies of the given RDD and any parent RDDs of non-ShuffleDependencyies, if available.

                                                                                                                                                                                                getShuffleDependenciesAndResourceProfiles is used when:

                                                                                                                                                                                                • DAGScheduler is requested to create a ShuffleMapStage and a ResultStage, and for the missing ShuffleDependencies of a RDD
                                                                                                                                                                                                "},{"location":"scheduler/DAGScheduler/#logging","title":"Logging

                                                                                                                                                                                                Enable ALL logging level for org.apache.spark.scheduler.DAGScheduler logger to see what happens inside.

                                                                                                                                                                                                Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                logger.DAGScheduler.name = org.apache.spark.scheduler.DAGScheduler\nlogger.DAGScheduler.level = all\n

                                                                                                                                                                                                Refer to Logging.

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGSchedulerEvent/","title":"DAGSchedulerEvent","text":"

                                                                                                                                                                                                DAGSchedulerEvent is an abstraction of events that are handled by the DAGScheduler (on dag-scheduler-event-loop daemon thread).

                                                                                                                                                                                                "},{"location":"scheduler/DAGSchedulerEvent/#alljobscancelled","title":"AllJobsCancelled

                                                                                                                                                                                                Carries no extra information

                                                                                                                                                                                                Posted when DAGScheduler is requested to cancelAllJobs

                                                                                                                                                                                                Event handler: doCancelAllJobs

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGSchedulerEvent/#beginevent","title":"BeginEvent

                                                                                                                                                                                                Carries the following:

                                                                                                                                                                                                • Task
                                                                                                                                                                                                • TaskInfo

                                                                                                                                                                                                Posted when DAGScheduler is requested to taskStarted

                                                                                                                                                                                                Event handler: handleBeginEvent

                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGSchedulerEvent/#completionevent","title":"CompletionEvent

                                                                                                                                                                                                Carries the following:

                                                                                                                                                                                                • Completed Task
                                                                                                                                                                                                • TaskEndReason
                                                                                                                                                                                                • Result (value computed)
                                                                                                                                                                                                • AccumulatorV2 Updates
                                                                                                                                                                                                • Metric Peaks
                                                                                                                                                                                                • TaskInfo

                                                                                                                                                                                                  Posted when DAGScheduler is requested to taskEnded

                                                                                                                                                                                                  Event handler: handleTaskCompletion

                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DAGSchedulerEvent/#executoradded","title":"ExecutorAdded

                                                                                                                                                                                                  Carries the following:

                                                                                                                                                                                                  • Executor ID
                                                                                                                                                                                                  • Host name

                                                                                                                                                                                                  Posted when DAGScheduler is requested to executorAdded

                                                                                                                                                                                                  Event handler: handleExecutorAdded

                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DAGSchedulerEvent/#executorlost","title":"ExecutorLost

                                                                                                                                                                                                  Carries the following:

                                                                                                                                                                                                  • Executor ID
                                                                                                                                                                                                  • Reason

                                                                                                                                                                                                  Posted when DAGScheduler is requested to executorLost

                                                                                                                                                                                                  Event handler: handleExecutorLost

                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DAGSchedulerEvent/#gettingresultevent","title":"GettingResultEvent

                                                                                                                                                                                                  Carries the following:

                                                                                                                                                                                                  • TaskInfo

                                                                                                                                                                                                  Posted when DAGScheduler is requested to taskGettingResult

                                                                                                                                                                                                  Event handler: handleGetTaskResult

                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DAGSchedulerEvent/#jobcancelled","title":"JobCancelled

                                                                                                                                                                                                  JobCancelled event carries the following:

                                                                                                                                                                                                  • Job ID
                                                                                                                                                                                                  • Reason (optional)

                                                                                                                                                                                                  Posted when DAGScheduler is requested to cancelJob

                                                                                                                                                                                                  Event handler: handleJobCancellation

                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DAGSchedulerEvent/#jobgroupcancelled","title":"JobGroupCancelled

                                                                                                                                                                                                  Carries the following:

                                                                                                                                                                                                  • Group ID

                                                                                                                                                                                                  Posted when DAGScheduler is requested to cancelJobGroup

                                                                                                                                                                                                  Event handler: handleJobGroupCancelled

                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DAGSchedulerEvent/#jobsubmitted","title":"JobSubmitted

                                                                                                                                                                                                  Carries the following:

                                                                                                                                                                                                  • Job ID
                                                                                                                                                                                                  • RDD
                                                                                                                                                                                                  • Partition processing function (with a TaskContext and the partition data, i.e. (TaskContext, Iterator[_]) => _)
                                                                                                                                                                                                  • Partition IDs to compute
                                                                                                                                                                                                  • CallSite
                                                                                                                                                                                                  • JobListener to keep updated about the status of the stage execution
                                                                                                                                                                                                  • Execution properties

                                                                                                                                                                                                  Posted when:

                                                                                                                                                                                                  • DAGScheduler is requested to submit a job, run an approximate job and handleJobSubmitted

                                                                                                                                                                                                  Event handler: handleJobSubmitted

                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DAGSchedulerEvent/#mapstagesubmitted","title":"MapStageSubmitted

                                                                                                                                                                                                  Carries the following:

                                                                                                                                                                                                  • Job ID
                                                                                                                                                                                                  • ShuffleDependency
                                                                                                                                                                                                  • CallSite
                                                                                                                                                                                                  • JobListener
                                                                                                                                                                                                  • Execution properties

                                                                                                                                                                                                  Posted when:

                                                                                                                                                                                                  • DAGScheduler is requested to submit a MapStage for execution

                                                                                                                                                                                                  Event handler: handleMapStageSubmitted

                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DAGSchedulerEvent/#resubmitfailedstages","title":"ResubmitFailedStages

                                                                                                                                                                                                  Carries no extra information.

                                                                                                                                                                                                  Posted when DAGScheduler is requested to handleTaskCompletion

                                                                                                                                                                                                  Event handler: resubmitFailedStages

                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DAGSchedulerEvent/#shufflemergefinalized","title":"ShuffleMergeFinalized

                                                                                                                                                                                                  Carries the following:

                                                                                                                                                                                                  • ShuffleMapStage

                                                                                                                                                                                                  Posted when:

                                                                                                                                                                                                  • DAGScheduler is requested to finalizeShuffleMerge

                                                                                                                                                                                                  Event handler: handleShuffleMergeFinalized

                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DAGSchedulerEvent/#speculativetasksubmitted","title":"SpeculativeTaskSubmitted

                                                                                                                                                                                                  Carries the following:

                                                                                                                                                                                                  • Task

                                                                                                                                                                                                  Posted when DAGScheduler is requested to speculativeTaskSubmitted

                                                                                                                                                                                                  Event handler: handleSpeculativeTaskSubmitted

                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DAGSchedulerEvent/#stagecancelled","title":"StageCancelled

                                                                                                                                                                                                  Carries the following:

                                                                                                                                                                                                  • Stage ID
                                                                                                                                                                                                  • Reason (optional)

                                                                                                                                                                                                  Posted when DAGScheduler is requested to cancelStage

                                                                                                                                                                                                  Event handler: handleStageCancellation

                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DAGSchedulerEvent/#tasksetfailed","title":"TaskSetFailed

                                                                                                                                                                                                  Carries the following:

                                                                                                                                                                                                  • TaskSet
                                                                                                                                                                                                  • Reason
                                                                                                                                                                                                  • Exception (optional)

                                                                                                                                                                                                  Posted when DAGScheduler is requested to taskSetFailed

                                                                                                                                                                                                  Event handler: handleTaskSetFailed

                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DAGSchedulerEvent/#workerremoved","title":"WorkerRemoved

                                                                                                                                                                                                  Carries the following:

                                                                                                                                                                                                  • Worked ID
                                                                                                                                                                                                  • Host name
                                                                                                                                                                                                  • Reason

                                                                                                                                                                                                  Posted when DAGScheduler is requested to workerRemoved

                                                                                                                                                                                                  Event handler: handleWorkerRemoved

                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DAGSchedulerEventProcessLoop/","title":"DAGSchedulerEventProcessLoop","text":"

                                                                                                                                                                                                  DAGSchedulerEventProcessLoop is an event processing daemon thread to handle DAGSchedulerEvents (on a separate thread from the parent DAGScheduler's).

                                                                                                                                                                                                  DAGSchedulerEventProcessLoop is registered under the name of dag-scheduler-event-loop.

                                                                                                                                                                                                  DAGSchedulerEventProcessLoop uses java.util.concurrent.LinkedBlockingDeque blocking deque that can grow indefinitely.

                                                                                                                                                                                                  "},{"location":"scheduler/DAGSchedulerEventProcessLoop/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                  DAGSchedulerEventProcessLoop takes the following to be created:

                                                                                                                                                                                                  • DAGScheduler

                                                                                                                                                                                                    DAGSchedulerEventProcessLoop is created\u00a0when:

                                                                                                                                                                                                    • DAGScheduler is created
                                                                                                                                                                                                    "},{"location":"scheduler/DAGSchedulerEventProcessLoop/#processing-event","title":"Processing Event DAGSchedulerEvent Event Handler AllJobsCancelled doCancelAllJobs BeginEvent handleBeginEvent CompletionEvent handleTaskCompletion ExecutorAdded handleExecutorAdded ExecutorLost handleExecutorLost GettingResultEvent handleGetTaskResult JobCancelled handleJobCancellation JobGroupCancelled handleJobGroupCancelled JobSubmitted handleJobSubmitted MapStageSubmitted handleMapStageSubmitted ResubmitFailedStages resubmitFailedStages SpeculativeTaskSubmitted handleSpeculativeTaskSubmitted StageCancelled handleStageCancellation TaskSetFailed handleTaskSetFailed WorkerRemoved handleWorkerRemoved","text":""},{"location":"scheduler/DAGSchedulerEventProcessLoop/#shufflemergefinalized","title":"ShuffleMergeFinalized
                                                                                                                                                                                                    • Event: ShuffleMergeFinalized
                                                                                                                                                                                                    • Event handler: handleShuffleMergeFinalized
                                                                                                                                                                                                    ","text":""},{"location":"scheduler/DAGSchedulerEventProcessLoop/#messageprocessingtime-timer","title":"messageProcessingTime Timer

                                                                                                                                                                                                    DAGSchedulerEventProcessLoop uses messageProcessingTime timer to measure time of processing events.

                                                                                                                                                                                                    ","text":""},{"location":"scheduler/DAGSchedulerSource/","title":"DAGSchedulerSource","text":"

                                                                                                                                                                                                    DAGSchedulerSource is the metrics source of DAGScheduler.

                                                                                                                                                                                                    The name of the source is DAGScheduler.

                                                                                                                                                                                                    DAGSchedulerSource emits the following metrics:

                                                                                                                                                                                                    • stage.failedStages - the number of failed stages
                                                                                                                                                                                                    • stage.runningStages - the number of running stages
                                                                                                                                                                                                    • stage.waitingStages - the number of waiting stages
                                                                                                                                                                                                    • job.allJobs - the number of all jobs
                                                                                                                                                                                                    • job.activeJobs - the number of active jobs
                                                                                                                                                                                                    "},{"location":"scheduler/DriverEndpoint/","title":"DriverEndpoint","text":"

                                                                                                                                                                                                    DriverEndpoint is a ThreadSafeRpcEndpoint that is a message handler for CoarseGrainedSchedulerBackend to communicate with CoarseGrainedExecutorBackend.

                                                                                                                                                                                                    DriverEndpoint is registered under the name CoarseGrainedScheduler by CoarseGrainedSchedulerBackend.

                                                                                                                                                                                                    DriverEndpoint uses executorDataMap internal registry of all the executors that registered with the driver. An executor sends a RegisterExecutor message to inform that it wants to register.

                                                                                                                                                                                                    "},{"location":"scheduler/DriverEndpoint/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                    DriverEndpoint takes no arguments to be created.

                                                                                                                                                                                                    DriverEndpoint is created when:

                                                                                                                                                                                                    • CoarseGrainedSchedulerBackend is created (and registers a CoarseGrainedScheduler RPC endpoint)
                                                                                                                                                                                                    "},{"location":"scheduler/DriverEndpoint/#executorlogurlhandler","title":"ExecutorLogUrlHandler
                                                                                                                                                                                                    logUrlHandler: ExecutorLogUrlHandler\n

                                                                                                                                                                                                    DriverEndpoint creates an ExecutorLogUrlHandler (based on spark.ui.custom.executor.log.url configuration property) when created.

                                                                                                                                                                                                    DriverEndpoint uses the ExecutorLogUrlHandler to create an ExecutorData when requested to handle a RegisterExecutor message.

                                                                                                                                                                                                    ","text":""},{"location":"scheduler/DriverEndpoint/#onStart","title":"Starting DriverEndpoint RpcEndpoint
                                                                                                                                                                                                    onStart(): Unit\n

                                                                                                                                                                                                    onStart is part of the RpcEndpoint abstraction.

                                                                                                                                                                                                    onStart requests the Revive Messages Scheduler Service to schedule a periodic action that sends ReviveOffers messages every revive interval (based on spark.scheduler.revive.interval configuration property).

                                                                                                                                                                                                    ","text":""},{"location":"scheduler/DriverEndpoint/#makeOffers","title":"Launching Tasks

                                                                                                                                                                                                    There are two makeOffers methods to launch tasks that differ by the number of active executor (from the executorDataMap registry) they work with:

                                                                                                                                                                                                    • All Active Executors
                                                                                                                                                                                                    • Single Executor
                                                                                                                                                                                                    ","text":""},{"location":"scheduler/DriverEndpoint/#on-all-active-executors","title":"On All Active Executors","text":"
                                                                                                                                                                                                    makeOffers(): Unit\n

                                                                                                                                                                                                    makeOffers builds WorkerOffers for every active executor (in the executorDataMap registry) and requests the TaskSchedulerImpl to generate tasks for the available worker offers (that creates TaskDescriptions).

                                                                                                                                                                                                    With tasks (TaskDescriptions) to be launched, makeOffers launches them.

                                                                                                                                                                                                    makeOffers is used when:

                                                                                                                                                                                                    • DriverEndpoint handles ReviveOffers messages
                                                                                                                                                                                                    "},{"location":"scheduler/DriverEndpoint/#on-single-executor","title":"On Single Executor","text":"
                                                                                                                                                                                                    makeOffers(\n  executorId: String): Unit\n

                                                                                                                                                                                                    Note

                                                                                                                                                                                                    makeOffers with a single executor is makeOffers for all active executors for just one executor.

                                                                                                                                                                                                    makeOffers is used when:

                                                                                                                                                                                                    • DriverEndpoint handles StatusUpdate and LaunchedExecutor messages
                                                                                                                                                                                                    "},{"location":"scheduler/DriverEndpoint/#launchTasks","title":"Launching Tasks","text":"
                                                                                                                                                                                                    launchTasks(\n  tasks: Seq[Seq[TaskDescription]]): Unit\n

                                                                                                                                                                                                    Note

                                                                                                                                                                                                    The input tasks collection contains one or more TaskDescriptions per executor (and the \"task partitioning\" per executor is of no use in launchTasks so it simply flattens the input data structure).

                                                                                                                                                                                                    For every TaskDescription (in the given tasks collection), launchTasks encodes it and makes sure that the encoded task size is below the allowed message size.

                                                                                                                                                                                                    launchTasks looks up the ExecutorData of the executor that has been assigned to execute the task (in executorDataMap internal registry) and decreases the executor's free cores (based on spark.task.cpus configuration property).

                                                                                                                                                                                                    Note

                                                                                                                                                                                                    Scheduling in Spark relies on cores only (not memory), i.e. the number of tasks Spark can run on an executor is limited by the number of cores available only. When submitting a Spark application for execution both executor resources -- memory and cores -- can however be specified explicitly. It is the job of a cluster manager to monitor the memory and take action when its use exceeds what was assigned.

                                                                                                                                                                                                    launchTasks prints out the following DEBUG message to the logs:

                                                                                                                                                                                                    Launching task [taskId] on executor id: [executorId] hostname: [executorHost].\n

                                                                                                                                                                                                    In the end, launchTasks sends the (serialized) task to the executor (by sending a LaunchTask message to the executor's RPC endpoint with the serialized task insize SerializableBuffer).

                                                                                                                                                                                                    Note

                                                                                                                                                                                                    This is the moment in a task's lifecycle when the driver sends the serialized task to an assigned executor.

                                                                                                                                                                                                    "},{"location":"scheduler/DriverEndpoint/#task-exceeds-allowed-size","title":"Task Exceeds Allowed Size

                                                                                                                                                                                                    In case the size of a serialized TaskDescription equals or exceeds the maximum allowed RPC message size, launchTasks looks up the TaskSetManager for the TaskDescription (in taskIdToTaskSetManager registry) and aborts it with the following message:

                                                                                                                                                                                                    Serialized task [id]:[index] was [limit] bytes, which exceeds max allowed: spark.rpc.message.maxSize ([maxRpcMessageSize] bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values.\n
                                                                                                                                                                                                    ","text":""},{"location":"scheduler/DriverEndpoint/#messages","title":"Messages","text":""},{"location":"scheduler/DriverEndpoint/#killexecutorsonhost","title":"KillExecutorsOnHost

                                                                                                                                                                                                    CoarseGrainedSchedulerBackend is requested to kill all executors on a node

                                                                                                                                                                                                    ","text":""},{"location":"scheduler/DriverEndpoint/#killtask","title":"KillTask

                                                                                                                                                                                                    CoarseGrainedSchedulerBackend is requested to kill a task.

                                                                                                                                                                                                    KillTask(\n  taskId: Long,\n  executor: String,\n  interruptThread: Boolean)\n

                                                                                                                                                                                                    KillTask is sent when CoarseGrainedSchedulerBackend kills a task.

                                                                                                                                                                                                    When KillTask is received, DriverEndpoint finds executor (in executorDataMap registry).

                                                                                                                                                                                                    If found, DriverEndpoint passes the message on to the executor (using its registered RPC endpoint for CoarseGrainedExecutorBackend).

                                                                                                                                                                                                    Otherwise, you should see the following WARN in the logs:

                                                                                                                                                                                                    Attempted to kill task [taskId] for unknown executor [executor].\n
                                                                                                                                                                                                    ","text":""},{"location":"scheduler/DriverEndpoint/#launchedexecutor","title":"LaunchedExecutor","text":""},{"location":"scheduler/DriverEndpoint/#registerexecutor","title":"RegisterExecutor

                                                                                                                                                                                                    CoarseGrainedExecutorBackend registers with the driver

                                                                                                                                                                                                    RegisterExecutor(\n  executorId: String,\n  executorRef: RpcEndpointRef,\n  hostname: String,\n  cores: Int,\n  logUrls: Map[String, String])\n

                                                                                                                                                                                                    RegisterExecutor is sent when CoarseGrainedExecutorBackend RPC Endpoint is requested to start.

                                                                                                                                                                                                    When received, DriverEndpoint makes sure that no other executors were registered under the input executorId and that the input hostname is not blacklisted.

                                                                                                                                                                                                    If the requirements hold, you should see the following INFO message in the logs:

                                                                                                                                                                                                    Registered executor [executorRef] ([address]) with ID [executorId]\n

                                                                                                                                                                                                    DriverEndpoint does the bookkeeping:

                                                                                                                                                                                                    • Registers executorId (in addressToExecutorId)
                                                                                                                                                                                                    • Adds cores (in totalCoreCount)
                                                                                                                                                                                                    • Increments totalRegisteredExecutors
                                                                                                                                                                                                    • Creates and registers ExecutorData for executorId (in executorDataMap)
                                                                                                                                                                                                    • Updates currentExecutorIdCounter if the input executorId is greater than the current value.

                                                                                                                                                                                                    If numPendingExecutors is greater than 0, you should see the following DEBUG message in the logs and DriverEndpoint decrements numPendingExecutors.

                                                                                                                                                                                                    Decremented number of pending executors ([numPendingExecutors] left)\n

                                                                                                                                                                                                    DriverEndpoint sends RegisteredExecutor message back (that is to confirm that the executor was registered successfully).

                                                                                                                                                                                                    DriverEndpoint replies true (to acknowledge the message).

                                                                                                                                                                                                    DriverEndpoint then announces the new executor by posting SparkListenerExecutorAdded to LiveListenerBus.

                                                                                                                                                                                                    In the end, DriverEndpoint makes executor resource offers (for launching tasks).

                                                                                                                                                                                                    If however there was already another executor registered under the input executorId, DriverEndpoint sends RegisterExecutorFailed message back with the reason:

                                                                                                                                                                                                    Duplicate executor ID: [executorId]\n

                                                                                                                                                                                                    If however the input hostname is blacklisted, you should see the following INFO message in the logs:

                                                                                                                                                                                                    Rejecting [executorId] as it has been blacklisted.\n

                                                                                                                                                                                                    DriverEndpoint sends RegisterExecutorFailed message back with the reason:

                                                                                                                                                                                                    Executor is blacklisted: [executorId]\n
                                                                                                                                                                                                    ","text":""},{"location":"scheduler/DriverEndpoint/#removeexecutor","title":"RemoveExecutor","text":""},{"location":"scheduler/DriverEndpoint/#removeworker","title":"RemoveWorker","text":""},{"location":"scheduler/DriverEndpoint/#retrievesparkappconfig","title":"RetrieveSparkAppConfig
                                                                                                                                                                                                    RetrieveSparkAppConfig(\n  resourceProfileId: Int)\n

                                                                                                                                                                                                    Posted when:

                                                                                                                                                                                                    • CoarseGrainedExecutorBackend standalone application is started

                                                                                                                                                                                                    When received, DriverEndpoint replies with a SparkAppConfig message with the following:

                                                                                                                                                                                                    1. spark-prefixed configuration properties
                                                                                                                                                                                                    2. IO Encryption Key
                                                                                                                                                                                                    3. Delegation tokens
                                                                                                                                                                                                    4. Default profile
                                                                                                                                                                                                    ","text":""},{"location":"scheduler/DriverEndpoint/#reviveoffers","title":"ReviveOffers

                                                                                                                                                                                                    Posted when:

                                                                                                                                                                                                    • Periodically (every spark.scheduler.revive.interval) right after DriverEndpoint is requested to start
                                                                                                                                                                                                    • CoarseGrainedSchedulerBackend is requested to revive resource offers

                                                                                                                                                                                                    When received, DriverEndpoint makes executor resource offers.

                                                                                                                                                                                                    ","text":""},{"location":"scheduler/DriverEndpoint/#statusupdate","title":"StatusUpdate

                                                                                                                                                                                                    CoarseGrainedExecutorBackend sends task status updates to the driver

                                                                                                                                                                                                    StatusUpdate(\n  executorId: String,\n  taskId: Long,\n  state: TaskState,\n  data: SerializableBuffer)\n

                                                                                                                                                                                                    StatusUpdate is sent when CoarseGrainedExecutorBackend sends task status updates to the driver.

                                                                                                                                                                                                    When StatusUpdate is received, DriverEndpoint requests the TaskSchedulerImpl to handle the task status update.

                                                                                                                                                                                                    If the task has finished, DriverEndpoint updates the number of cores available for work on the corresponding executor (registered in executorDataMap).

                                                                                                                                                                                                    DriverEndpoint makes an executor resource offer on the single executor.

                                                                                                                                                                                                    When DriverEndpoint found no executor (in executorDataMap), you should see the following WARN message in the logs:

                                                                                                                                                                                                    Ignored task status update ([taskId] state [state]) from unknown executor with ID [executorId]\n
                                                                                                                                                                                                    ","text":""},{"location":"scheduler/DriverEndpoint/#stopdriver","title":"StopDriver","text":""},{"location":"scheduler/DriverEndpoint/#stopexecutors","title":"StopExecutors

                                                                                                                                                                                                    StopExecutors message is receive-reply and blocking. When received, the following INFO message appears in the logs:

                                                                                                                                                                                                    Asking each executor to shut down\n

                                                                                                                                                                                                    It then sends a StopExecutor message to every registered executor (from executorDataMap).

                                                                                                                                                                                                    ","text":""},{"location":"scheduler/DriverEndpoint/#updatedelegationtokens","title":"UpdateDelegationTokens","text":""},{"location":"scheduler/DriverEndpoint/#removing-executor","title":"Removing Executor
                                                                                                                                                                                                    removeExecutor(\n  executorId: String,\n  reason: ExecutorLossReason): Unit\n

                                                                                                                                                                                                    When removeExecutor is executed, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                    Asked to remove executor [executorId] with reason [reason]\n

                                                                                                                                                                                                    removeExecutor then tries to find the executorId executor (in executorDataMap internal registry).

                                                                                                                                                                                                    If the executorId executor was found, removeExecutor removes the executor from the following registries:

                                                                                                                                                                                                    • addressToExecutorId
                                                                                                                                                                                                    • executorDataMap
                                                                                                                                                                                                    • <>
                                                                                                                                                                                                    • executorsPendingToRemove
                                                                                                                                                                                                    • removeExecutor decrements:

                                                                                                                                                                                                      • totalCoreCount by the executor's totalCores
                                                                                                                                                                                                      • totalRegisteredExecutors

                                                                                                                                                                                                      In the end, removeExecutor notifies TaskSchedulerImpl that an executor was lost.

                                                                                                                                                                                                      removeExecutor posts SparkListenerExecutorRemoved to LiveListenerBus (with the executorId executor).

                                                                                                                                                                                                      If however the executorId executor could not be found, removeExecutor requests BlockManagerMaster to remove the executor asynchronously.

                                                                                                                                                                                                      Note

                                                                                                                                                                                                      removeExecutor uses SparkEnv to access the current BlockManager and then BlockManagerMaster.

                                                                                                                                                                                                      You should see the following INFO message in the logs:

                                                                                                                                                                                                      Asked to remove non-existent executor [executorId]\n

                                                                                                                                                                                                      removeExecutor is used when DriverEndpoint handles RemoveExecutor message and gets disassociated with a remote RPC endpoint of an executor.

                                                                                                                                                                                                      ","text":""},{"location":"scheduler/DriverEndpoint/#removing-worker","title":"Removing Worker
                                                                                                                                                                                                      removeWorker(\n  workerId: String,\n  host: String,\n  message: String): Unit\n

                                                                                                                                                                                                      removeWorker prints out the following DEBUG message to the logs:

                                                                                                                                                                                                      Asked to remove worker [workerId] with reason [message]\n

                                                                                                                                                                                                      In the end, removeWorker simply requests the TaskSchedulerImpl to workerRemoved.

                                                                                                                                                                                                      removeWorker is used when DriverEndpoint is requested to handle a RemoveWorker event.

                                                                                                                                                                                                      ","text":""},{"location":"scheduler/DriverEndpoint/#processing-one-way-messages","title":"Processing One-Way Messages
                                                                                                                                                                                                      receive: PartialFunction[Any, Unit]\n

                                                                                                                                                                                                      receive is part of the RpcEndpoint abstraction.

                                                                                                                                                                                                      receive...FIXME

                                                                                                                                                                                                      ","text":""},{"location":"scheduler/DriverEndpoint/#processing-two-way-messages","title":"Processing Two-Way Messages
                                                                                                                                                                                                      receiveAndReply(\n  context: RpcCallContext): PartialFunction[Any, Unit]\n

                                                                                                                                                                                                      receiveAndReply is part of the RpcEndpoint abstraction.

                                                                                                                                                                                                      receiveAndReply...FIXME

                                                                                                                                                                                                      ","text":""},{"location":"scheduler/DriverEndpoint/#ondisconnected-callback","title":"onDisconnected Callback

                                                                                                                                                                                                      onDisconnected removes the worker from the internal addressToExecutorId registry (that effectively removes the worker from a cluster).

                                                                                                                                                                                                      onDisconnected removes the executor with the reason being SlaveLost and message:

                                                                                                                                                                                                      Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.\n
                                                                                                                                                                                                      ","text":""},{"location":"scheduler/DriverEndpoint/#executors-by-rpcaddress-registry","title":"Executors by RpcAddress Registry
                                                                                                                                                                                                      addressToExecutorId: Map[RpcAddress, String]\n

                                                                                                                                                                                                      Executor addresses (host and port) for executors.

                                                                                                                                                                                                      Set when an executor connects to register itself.

                                                                                                                                                                                                      ","text":""},{"location":"scheduler/DriverEndpoint/#disabling-executor","title":"Disabling Executor
                                                                                                                                                                                                      disableExecutor(\n  executorId: String): Boolean\n

                                                                                                                                                                                                      disableExecutor checks whether the executor is active:

                                                                                                                                                                                                      • If so, disableExecutor adds the executor to the executorsPendingLossReason registry
                                                                                                                                                                                                      • Otherwise, disableExecutor checks whether added to executorsPendingToRemove registry

                                                                                                                                                                                                      disableExecutor determines whether the executor should really be disabled (as active or registered in executorsPendingToRemove registry).

                                                                                                                                                                                                      If the executor should be disabled, disableExecutor prints out the following INFO message to the logs and notifies the TaskSchedulerImpl that the executor is lost.

                                                                                                                                                                                                      Disabling executor [executorId].\n

                                                                                                                                                                                                      disableExecutor returns the indication whether the executor should have been disabled or not.

                                                                                                                                                                                                      disableExecutor is used when:

                                                                                                                                                                                                      • KubernetesDriverEndpoint is requested to handle onDisconnected event
                                                                                                                                                                                                      • YarnDriverEndpoint is requested to handle onDisconnected event
                                                                                                                                                                                                      ","text":""},{"location":"scheduler/DriverEndpoint/#logging","title":"Logging

                                                                                                                                                                                                      Enable ALL logging level for org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.DriverEndpoint logger to see what happens inside.

                                                                                                                                                                                                      Add the following line to conf/log4j.properties:

                                                                                                                                                                                                      log4j.logger.org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.DriverEndpoint=ALL\n

                                                                                                                                                                                                      Refer to Logging.

                                                                                                                                                                                                      ","text":""},{"location":"scheduler/ExecutorData/","title":"ExecutorData","text":"

                                                                                                                                                                                                      ExecutorData is a metadata of an executor:

                                                                                                                                                                                                      • Executor's RPC Endpoint
                                                                                                                                                                                                      • Executor's RpcAddress
                                                                                                                                                                                                      • Executor's Host
                                                                                                                                                                                                      • Executor's Free Cores
                                                                                                                                                                                                      • Executor's Total Cores
                                                                                                                                                                                                      • Executor's Log URLs (Map[String, String])
                                                                                                                                                                                                      • Executor's Attributes (Map[String, String])
                                                                                                                                                                                                      • Executor's Resources Info (Map[String, ExecutorResourceInfo])
                                                                                                                                                                                                      • Executor's ResourceProfile ID

                                                                                                                                                                                                        ExecutorData is created for every executor registered (when DriverEndpoint is requested to handle a RegisterExecutor message).

                                                                                                                                                                                                        ExecutorData is used by CoarseGrainedSchedulerBackend to track registered executors.

                                                                                                                                                                                                        Note

                                                                                                                                                                                                        ExecutorData is posted as part of SparkListenerExecutorAdded event by DriverEndpoint every time an executor is registered.

                                                                                                                                                                                                        "},{"location":"scheduler/ExternalClusterManager/","title":"ExternalClusterManager","text":"

                                                                                                                                                                                                        ExternalClusterManager is an abstraction of pluggable cluster managers that can create a SchedulerBackend and TaskScheduler for a given master URL (when SparkContext is created).

                                                                                                                                                                                                        Note

                                                                                                                                                                                                        The support for pluggable cluster managers was introduced in SPARK-13904 Add support for pluggable cluster manager.

                                                                                                                                                                                                        ExternalClusterManager can be registered using the java.util.ServiceLoader mechanism (with service markers under META-INF/services directory).

                                                                                                                                                                                                        "},{"location":"scheduler/ExternalClusterManager/#contract","title":"Contract","text":""},{"location":"scheduler/ExternalClusterManager/#checking-support-for-master-url","title":"Checking Support for Master URL
                                                                                                                                                                                                        canCreate(\n  masterURL: String): Boolean\n

                                                                                                                                                                                                        Checks whether this cluster manager instance can create scheduler components for a given master URL

                                                                                                                                                                                                        Used when SparkContext is created (and requested for a cluster manager)

                                                                                                                                                                                                        ","text":""},{"location":"scheduler/ExternalClusterManager/#creating-schedulerbackend","title":"Creating SchedulerBackend
                                                                                                                                                                                                        createSchedulerBackend(\n  sc: SparkContext,\n  masterURL: String,\n  scheduler: TaskScheduler): SchedulerBackend\n

                                                                                                                                                                                                        Creates a SchedulerBackend for a given SparkContext, master URL, and TaskScheduler.

                                                                                                                                                                                                        Used when SparkContext is created (and requested for a SchedulerBackend and TaskScheduler)

                                                                                                                                                                                                        ","text":""},{"location":"scheduler/ExternalClusterManager/#creating-taskscheduler","title":"Creating TaskScheduler
                                                                                                                                                                                                        createTaskScheduler(\n  sc: SparkContext,\n  masterURL: String): TaskScheduler\n

                                                                                                                                                                                                        Creates a TaskScheduler for a given SparkContext and master URL

                                                                                                                                                                                                        Used when SparkContext is created (and requested for a SchedulerBackend and TaskScheduler)

                                                                                                                                                                                                        ","text":""},{"location":"scheduler/ExternalClusterManager/#initializing-scheduling-components","title":"Initializing Scheduling Components
                                                                                                                                                                                                        initialize(\n  scheduler: TaskScheduler,\n  backend: SchedulerBackend): Unit\n

                                                                                                                                                                                                        Initializes the TaskScheduler and SchedulerBackend

                                                                                                                                                                                                        Used when SparkContext is created (and requested for a SchedulerBackend and TaskScheduler)

                                                                                                                                                                                                        ","text":""},{"location":"scheduler/ExternalClusterManager/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                        • KubernetesClusterManager (Spark on Kubernetes)
                                                                                                                                                                                                        • MesosClusterManager
                                                                                                                                                                                                        • YarnClusterManager
                                                                                                                                                                                                        "},{"location":"scheduler/FIFOSchedulableBuilder/","title":"FIFOSchedulableBuilder","text":"

                                                                                                                                                                                                        == FIFOSchedulableBuilder - SchedulableBuilder for FIFO Scheduling Mode

                                                                                                                                                                                                        FIFOSchedulableBuilder is a <> that holds a single spark-scheduler-Pool.md[Pool] (that is given when <FIFOSchedulableBuilder is created>>).

                                                                                                                                                                                                        NOTE: FIFOSchedulableBuilder is the scheduler:TaskSchedulerImpl.md#creating-instance[default SchedulableBuilder for TaskSchedulerImpl].

                                                                                                                                                                                                        NOTE: When FIFOSchedulableBuilder is created, the TaskSchedulerImpl passes its own rootPool (a part of scheduler:TaskScheduler.md#contract[TaskScheduler Contract]).

                                                                                                                                                                                                        FIFOSchedulableBuilder obeys the <> as follows:

                                                                                                                                                                                                        • <> does nothing.
                                                                                                                                                                                                        • addTaskSetManager spark-scheduler-Pool.md#addSchedulable[passes the input Schedulable to the one and only rootPool Pool (using addSchedulable)] and completely disregards the properties of the Schedulable.
                                                                                                                                                                                                        • === [[creating-instance]] Creating FIFOSchedulableBuilder Instance

                                                                                                                                                                                                          FIFOSchedulableBuilder takes the following when created:

                                                                                                                                                                                                          • [[rootPool]] rootPool spark-scheduler-Pool.md[Pool]
                                                                                                                                                                                                          "},{"location":"scheduler/FairSchedulableBuilder/","title":"FairSchedulableBuilder","text":"

                                                                                                                                                                                                          FairSchedulableBuilder is a <> that is <> exclusively for scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl] for FAIR scheduling mode (when configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property is FAIR).

                                                                                                                                                                                                          [[creating-instance]] FairSchedulableBuilder takes the following to be created:

                                                                                                                                                                                                          • [[rootPool]] <>
                                                                                                                                                                                                          • [[conf]] SparkConf.md[]
                                                                                                                                                                                                          • Once <>, TaskSchedulerImpl requests the FairSchedulableBuilder to <>.

                                                                                                                                                                                                            [[DEFAULT_SCHEDULER_FILE]] FairSchedulableBuilder uses the pools defined in an <> that is assumed to be the value of the configuration-properties.md#spark.scheduler.allocation.file[spark.scheduler.allocation.file] configuration property or the default fairscheduler.xml (that is <>).

                                                                                                                                                                                                            TIP: Use conf/fairscheduler.xml.template as a template for the <>.

                                                                                                                                                                                                            [[DEFAULT_POOL_NAME]] FairSchedulableBuilder always has the default pool defined (and <> unless done in the <>).

                                                                                                                                                                                                            [[FAIR_SCHEDULER_PROPERTIES]] [[spark.scheduler.pool]] FairSchedulableBuilder uses spark.scheduler.pool local property for the name of the pool to use when requested to <> (default: <>).

                                                                                                                                                                                                            Note

                                                                                                                                                                                                            SparkContext.setLocalProperty lets you set local properties per thread to group jobs in logical groups, e.g. to allow FairSchedulableBuilder to use spark.scheduler.pool property and to group jobs from different threads to be submitted for execution on a non-<> pool."},{"location":"scheduler/FairSchedulableBuilder/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                            scala> :type sc org.apache.spark.SparkContext

                                                                                                                                                                                                            sc.setLocalProperty(\"spark.scheduler.pool\", \"production\")

                                                                                                                                                                                                            "},{"location":"scheduler/FairSchedulableBuilder/#whatever-is-executed-afterwards-is-submitted-to-production-pool","title":"// whatever is executed afterwards is submitted to production pool","text":"

                                                                                                                                                                                                            [[logging]] [TIP] ==== Enable ALL logging level for org.apache.spark.scheduler.FairSchedulableBuilder logger to see what happens inside.

                                                                                                                                                                                                            Add the following line to conf/log4j.properties:

                                                                                                                                                                                                            log4j.logger.org.apache.spark.scheduler.FairSchedulableBuilder=ALL\n
                                                                                                                                                                                                            "},{"location":"scheduler/FairSchedulableBuilder/#refer-to","title":"Refer to <>.

                                                                                                                                                                                                            === [[allocations-file]] Allocation Pools Configuration File

                                                                                                                                                                                                            The allocation pools configuration file is an XML file.

                                                                                                                                                                                                            The default conf/fairscheduler.xml.template is as follows:

                                                                                                                                                                                                            ","text":""},{"location":"scheduler/FairSchedulableBuilder/#source-xml","title":"[source, xml]","text":"

                                                                                                                                                                                                            FAIR 1 2 FIFO 2 3

                                                                                                                                                                                                            TIP: The top-level element's name allocations can be anything. Spark does not insist on allocations and accepts any name.

                                                                                                                                                                                                            === [[buildPools]] Building (Tree of) Pools of Schedulables -- buildPools Method

                                                                                                                                                                                                            "},{"location":"scheduler/FairSchedulableBuilder/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/FairSchedulableBuilder/#buildpools-unit","title":"buildPools(): Unit","text":"

                                                                                                                                                                                                            NOTE: buildPools is part of the <> to build a tree of <>.

                                                                                                                                                                                                            buildPools <> if available and then <>.

                                                                                                                                                                                                            buildPools prints out the following INFO message to the logs when the configuration file (per the configuration-properties.md#spark.scheduler.allocation.file[spark.scheduler.allocation.file] configuration property) could be read:

                                                                                                                                                                                                            Creating Fair Scheduler pools from [file]\n

                                                                                                                                                                                                            buildPools prints out the following INFO message to the logs when the configuration-properties.md#spark.scheduler.allocation.file[spark.scheduler.allocation.file] configuration property was not used to define the configuration file and the <> is used instead:

                                                                                                                                                                                                            Creating Fair Scheduler pools from default file: [DEFAULT_SCHEDULER_FILE]\n

                                                                                                                                                                                                            When neither configuration-properties.md#spark.scheduler.allocation.file[spark.scheduler.allocation.file] configuration property nor the <> could be used, buildPools prints out the following WARN message to the logs:

                                                                                                                                                                                                            Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in [DEFAULT_SCHEDULER_FILE] or set spark.scheduler.allocation.file to a file that contains the configuration.\n

                                                                                                                                                                                                            === [[addTaskSetManager]] addTaskSetManager Method

                                                                                                                                                                                                            "},{"location":"scheduler/FairSchedulableBuilder/#source-scala_2","title":"[source, scala]","text":""},{"location":"scheduler/FairSchedulableBuilder/#addtasksetmanagermanager-schedulable-properties-properties-unit","title":"addTaskSetManager(manager: Schedulable, properties: Properties): Unit","text":"

                                                                                                                                                                                                            NOTE: addTaskSetManager is part of the <> to register a new <> with the <>

                                                                                                                                                                                                            addTaskSetManager finds the pool by name (in the given Properties) under the <> property or defaults to the <> pool if undefined.

                                                                                                                                                                                                            addTaskSetManager then requests the <> to <>.

                                                                                                                                                                                                            Unless found, addTaskSetManager creates a new <> with the <> (as if the <> pool were used) and requests the <> to <>. In the end, addTaskSetManager prints out the following WARN message to the logs:

                                                                                                                                                                                                            A job was submitted with scheduler pool [poolName], which has not been configured. This can happen when the file that pools are read from isn't set, or when that file doesn't contain [poolName]. Created [poolName] with default configuration (schedulingMode: [mode], minShare: [minShare], weight: [weight])\n

                                                                                                                                                                                                            addTaskSetManager then requests the pool (found or newly-created) to <> the given <>.

                                                                                                                                                                                                            In the end, addTaskSetManager prints out the following INFO message to the logs:

                                                                                                                                                                                                            Added task set [name] tasks to pool [poolName]\n

                                                                                                                                                                                                            === [[buildDefaultPool]] Registering Default Pool -- buildDefaultPool Method

                                                                                                                                                                                                            "},{"location":"scheduler/FairSchedulableBuilder/#source-scala_3","title":"[source, scala]","text":""},{"location":"scheduler/FairSchedulableBuilder/#builddefaultpool-unit","title":"buildDefaultPool(): Unit","text":"

                                                                                                                                                                                                            buildDefaultPool requests the <> to <> (one with the <> name).

                                                                                                                                                                                                            Unless already available, buildDefaultPool creates a <> with the following:

                                                                                                                                                                                                            • <> pool name

                                                                                                                                                                                                            • FIFO scheduling mode

                                                                                                                                                                                                            • 0 for the initial minimum share

                                                                                                                                                                                                            • 1 for the initial weight

                                                                                                                                                                                                            • In the end, buildDefaultPool requests the <> to <> followed by the INFO message in the logs:

                                                                                                                                                                                                              Created default pool: [name], schedulingMode: [mode], minShare: [minShare], weight: [weight]\n

                                                                                                                                                                                                              NOTE: buildDefaultPool is used exclusively when FairSchedulableBuilder is requested to <>.

                                                                                                                                                                                                              === [[buildFairSchedulerPool]] Building Pools from XML Allocations File -- buildFairSchedulerPool Internal Method

                                                                                                                                                                                                              "},{"location":"scheduler/FairSchedulableBuilder/#source-scala_4","title":"[source, scala]","text":"

                                                                                                                                                                                                              buildFairSchedulerPool( is: InputStream, fileName: String): Unit

                                                                                                                                                                                                              buildFairSchedulerPool starts by loading the XML file from the given InputStream.

                                                                                                                                                                                                              For every pool element, buildFairSchedulerPool creates a <> with the following:

                                                                                                                                                                                                              • Pool name per name attribute

                                                                                                                                                                                                              • Scheduling mode per schedulingMode element (case-insensitive with FIFO as the default)

                                                                                                                                                                                                              • Initial minimum share per minShare element (default: 0)

                                                                                                                                                                                                              • Initial weight per weight element (default: 1)

                                                                                                                                                                                                              In the end, buildFairSchedulerPool requests the <> to <> followed by the INFO message in the logs:

                                                                                                                                                                                                              Created pool: [name], schedulingMode: [mode], minShare: [minShare], weight: [weight]\n

                                                                                                                                                                                                              NOTE: buildFairSchedulerPool is used exclusively when FairSchedulableBuilder is requested to <>."},{"location":"scheduler/HighlyCompressedMapStatus/","title":"HighlyCompressedMapStatus","text":"

                                                                                                                                                                                                              HighlyCompressedMapStatus is...FIXME

                                                                                                                                                                                                              "},{"location":"scheduler/JobListener/","title":"JobListener","text":"

                                                                                                                                                                                                              JobListener is an abstraction of listeners that listen for job completion or failure events (after submitting a job to the DAGScheduler).

                                                                                                                                                                                                              "},{"location":"scheduler/JobListener/#contract","title":"Contract","text":""},{"location":"scheduler/JobListener/#tasksucceeded","title":"taskSucceeded
                                                                                                                                                                                                              taskSucceeded(\n  index: Int,\n  result: Any): Unit\n

                                                                                                                                                                                                              Used when DAGScheduler is requested to handleTaskCompletion or markMapStageJobAsFinished

                                                                                                                                                                                                              ","text":""},{"location":"scheduler/JobListener/#jobfailed","title":"jobFailed
                                                                                                                                                                                                              jobFailed(\n  exception: Exception): Unit\n

                                                                                                                                                                                                              Used when DAGScheduler is requested to cleanUpAfterSchedulerStop, handleJobSubmitted, handleMapStageSubmitted, handleTaskCompletion or failJobAndIndependentStages

                                                                                                                                                                                                              ","text":""},{"location":"scheduler/JobListener/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                              • ApproximateActionListener
                                                                                                                                                                                                              • JobWaiter
                                                                                                                                                                                                              "},{"location":"scheduler/JobWaiter/","title":"JobWaiter","text":"

                                                                                                                                                                                                              JobWaiter is a JobListener to listen to task events and to know when all have finished successfully or not.

                                                                                                                                                                                                              "},{"location":"scheduler/JobWaiter/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                              JobWaiter takes the following to be created:

                                                                                                                                                                                                              • DAGScheduler
                                                                                                                                                                                                              • Job ID
                                                                                                                                                                                                              • Total number of tasks
                                                                                                                                                                                                              • Result Handler Function ((Int, T) => Unit)

                                                                                                                                                                                                                JobWaiter is created\u00a0when DAGScheduler is requested to submit a job or a map stage.

                                                                                                                                                                                                                "},{"location":"scheduler/JobWaiter/#scala-promise","title":"Scala Promise
                                                                                                                                                                                                                jobPromise: Promise[Unit]\n

                                                                                                                                                                                                                jobPromise is a Scala Promise that is completed when all tasks have finished successfully or failed with an exception.

                                                                                                                                                                                                                ","text":""},{"location":"scheduler/JobWaiter/#tasksucceeded","title":"taskSucceeded
                                                                                                                                                                                                                taskSucceeded(\n  index: Int,\n  result: Any): Unit\n

                                                                                                                                                                                                                taskSucceeded executes the Result Handler Function with the given index and result.

                                                                                                                                                                                                                taskSucceeded marks the waiter finished successfully when all tasks have finished.

                                                                                                                                                                                                                taskSucceeded\u00a0is part of the JobListener abstraction.

                                                                                                                                                                                                                ","text":""},{"location":"scheduler/JobWaiter/#jobfailed","title":"jobFailed
                                                                                                                                                                                                                jobFailed(\n  exception: Exception): Unit\n

                                                                                                                                                                                                                jobFailed marks the waiter failed.

                                                                                                                                                                                                                jobFailed\u00a0is part of the JobListener abstraction.

                                                                                                                                                                                                                ","text":""},{"location":"scheduler/LiveListenerBus/","title":"LiveListenerBus","text":"

                                                                                                                                                                                                                LiveListenerBus is an event bus to dispatch Spark events to registered SparkListeners.

                                                                                                                                                                                                                LiveListenerBus is a single-JVM SparkListenerBus that uses listenerThread to poll events.

                                                                                                                                                                                                                Note

                                                                                                                                                                                                                The event queue is java.util.concurrent.LinkedBlockingQueue with capacity of 10000 SparkListenerEvent events.

                                                                                                                                                                                                                "},{"location":"scheduler/LiveListenerBus/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                LiveListenerBus takes the following to be created:

                                                                                                                                                                                                                • SparkConf

                                                                                                                                                                                                                  LiveListenerBus is created (and started) when SparkContext is requested to initialize.

                                                                                                                                                                                                                  "},{"location":"scheduler/LiveListenerBus/#event-queues","title":"Event Queues
                                                                                                                                                                                                                  queues: CopyOnWriteArrayList[AsyncEventQueue]\n

                                                                                                                                                                                                                  LiveListenerBus manages AsyncEventQueues.

                                                                                                                                                                                                                  queues is initialized empty when LiveListenerBus is created.

                                                                                                                                                                                                                  queues is used when:

                                                                                                                                                                                                                  • Registering Listener with Queue
                                                                                                                                                                                                                  • Posting Event to All Queues
                                                                                                                                                                                                                  • Deregistering Listener
                                                                                                                                                                                                                  • Starting LiveListenerBus
                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/LiveListenerBus/#livelistenerbusmetrics","title":"LiveListenerBusMetrics
                                                                                                                                                                                                                  metrics: LiveListenerBusMetrics\n

                                                                                                                                                                                                                  LiveListenerBus creates a LiveListenerBusMetrics when created.

                                                                                                                                                                                                                  metrics is registered (with a MetricsSystem) when LiveListenerBus is started.

                                                                                                                                                                                                                  metrics is used to:

                                                                                                                                                                                                                  • Increment events posted every event posting
                                                                                                                                                                                                                  • Create a AsyncEventQueue when adding a listener to a queue
                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/LiveListenerBus/#starting-livelistenerbus","title":"Starting LiveListenerBus
                                                                                                                                                                                                                  start(\n  sc: SparkContext,\n  metricsSystem: MetricsSystem): Unit\n

                                                                                                                                                                                                                  start starts AsyncEventQueues (from the queues internal registry).

                                                                                                                                                                                                                  In the end, start requests the given MetricsSystem to register the LiveListenerBusMetrics.

                                                                                                                                                                                                                  start is used when:

                                                                                                                                                                                                                  • SparkContext is created
                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/LiveListenerBus/#posting-event-to-all-queues","title":"Posting Event to All Queues
                                                                                                                                                                                                                  post(\n  event: SparkListenerEvent): Unit\n

                                                                                                                                                                                                                  post puts the input event onto the internal eventQueue queue and releases the internal eventLock semaphore. If the event placement was not successful (and it could happen since it is tapped at 10000 events) onDropEvent method is called.

                                                                                                                                                                                                                  The event publishing is only possible when stopped flag has been enabled.

                                                                                                                                                                                                                  post is used when...FIXME

                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/LiveListenerBus/#posttoqueues","title":"postToQueues
                                                                                                                                                                                                                  postToQueues(\n  event: SparkListenerEvent): Unit\n

                                                                                                                                                                                                                  postToQueues...FIXME

                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/LiveListenerBus/#event-dropped-callback","title":"Event Dropped Callback
                                                                                                                                                                                                                  onDropEvent(\n  event: SparkListenerEvent): Unit\n

                                                                                                                                                                                                                  onDropEvent is called when no further events can be added to the internal eventQueue queue (while posting a SparkListenerEvent event).

                                                                                                                                                                                                                  It simply prints out the following ERROR message to the logs and ensures that it happens only once.

                                                                                                                                                                                                                  Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.\n
                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/LiveListenerBus/#stopping-livelistenerbus","title":"Stopping LiveListenerBus
                                                                                                                                                                                                                  stop(): Unit\n

                                                                                                                                                                                                                  stop releases the internal eventLock semaphore and waits until listenerThread dies. It can only happen after all events were posted (and polling eventQueue gives nothing).

                                                                                                                                                                                                                  stopped flag is enabled.

                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/LiveListenerBus/#listenerthread-for-event-polling","title":"listenerThread for Event Polling

                                                                                                                                                                                                                  LiveListenerBus uses a SparkListenerBus single-daemon thread that ensures that the polling events from the event queue is only after the listener was started and only one event at a time.

                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-status-queue","title":"Registering Listener with Status Queue
                                                                                                                                                                                                                  addToStatusQueue(\n  listener: SparkListenerInterface): Unit\n

                                                                                                                                                                                                                  addToStatusQueue adds the given SparkListenerInterface to appStatus queue.

                                                                                                                                                                                                                  addToStatusQueue is used when:

                                                                                                                                                                                                                  • BarrierCoordinator is requested to onStart
                                                                                                                                                                                                                  • SparkContext is created
                                                                                                                                                                                                                  • HiveThriftServer2 utility is used to createListenerAndUI
                                                                                                                                                                                                                  • SharedState (Spark SQL) is requested to create a SQLAppStatusStore
                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-shared-queue","title":"Registering Listener with Shared Queue
                                                                                                                                                                                                                  addToSharedQueue(\n  listener: SparkListenerInterface): Unit\n

                                                                                                                                                                                                                  addToSharedQueue adds the given SparkListenerInterface to shared queue.

                                                                                                                                                                                                                  addToSharedQueue is used when:

                                                                                                                                                                                                                  • SparkContext is requested to register a SparkListener and register extra SparkListeners
                                                                                                                                                                                                                  • ExecutionListenerBus (Spark Structured Streaming) is created
                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-executormanagement-queue","title":"Registering Listener with executorManagement Queue
                                                                                                                                                                                                                  addToManagementQueue(\n  listener: SparkListenerInterface): Unit\n

                                                                                                                                                                                                                  addToManagementQueue adds the given SparkListenerInterface to executorManagement queue.

                                                                                                                                                                                                                  addToManagementQueue is used when:

                                                                                                                                                                                                                  • ExecutorAllocationManager is requested to start
                                                                                                                                                                                                                  • HeartbeatReceiver is created
                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-eventlog-queue","title":"Registering Listener with eventLog Queue
                                                                                                                                                                                                                  addToEventLogQueue(\n  listener: SparkListenerInterface): Unit\n

                                                                                                                                                                                                                  addToEventLogQueue adds the given SparkListenerInterface to eventLog queue.

                                                                                                                                                                                                                  addToEventLogQueue is used when:

                                                                                                                                                                                                                  • SparkContext is created (with event logging enabled)
                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-queue","title":"Registering Listener with Queue
                                                                                                                                                                                                                  addToQueue(\n  listener: SparkListenerInterface,\n  queue: String): Unit\n

                                                                                                                                                                                                                  addToQueue finds the queue in the queues internal registry.

                                                                                                                                                                                                                  If found, addToQueue requests it to add the given listener

                                                                                                                                                                                                                  If not found, addToQueue creates a AsyncEventQueue (with the given name, the LiveListenerBusMetrics, and this LiveListenerBus) and requests it to add the given listener. The AsyncEventQueue is started and added to the queues internal registry.

                                                                                                                                                                                                                  addToQueue is used when:

                                                                                                                                                                                                                  • LiveListenerBus is requested to addToSharedQueue, addToManagementQueue, addToStatusQueue, addToEventLogQueue
                                                                                                                                                                                                                  • StreamingQueryListenerBus (Spark Structured Streaming) is created
                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/MapOutputStatistics/","title":"MapOutputStatistics","text":"

                                                                                                                                                                                                                  MapOutputStatistics holds statistics about the output partition sizes in a map stage.

                                                                                                                                                                                                                  MapOutputStatistics is the result of executing the following (currently internal APIs):

                                                                                                                                                                                                                  • SparkContext is requested to submitMapStage
                                                                                                                                                                                                                  • DAGScheduler is requested to submitMapStage
                                                                                                                                                                                                                  "},{"location":"scheduler/MapOutputStatistics/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                  MapOutputStatistics takes the following to be created:

                                                                                                                                                                                                                  • Shuffle Id (of a ShuffleDependency)
                                                                                                                                                                                                                  • Output Partition Sizes (Array[Long])

                                                                                                                                                                                                                    MapOutputStatistics is created when:

                                                                                                                                                                                                                    • MapOutputTrackerMaster is requested for the statistics (of a ShuffleDependency)
                                                                                                                                                                                                                    "},{"location":"scheduler/MapOutputTracker/","title":"MapOutputTracker","text":"

                                                                                                                                                                                                                    MapOutputTracker is an base abstraction of shuffle map output location registries.

                                                                                                                                                                                                                    "},{"location":"scheduler/MapOutputTracker/#contract","title":"Contract","text":""},{"location":"scheduler/MapOutputTracker/#getmapsizesbyexecutorid","title":"getMapSizesByExecutorId
                                                                                                                                                                                                                    getMapSizesByExecutorId(\n  shuffleId: Int,\n  startPartition: Int,\n  endPartition: Int): Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]\n

                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                    • SortShuffleManager is requested for a ShuffleReader
                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/MapOutputTracker/#getmapsizesbyrange","title":"getMapSizesByRange
                                                                                                                                                                                                                    getMapSizesByRange(\n  shuffleId: Int,\n  startMapIndex: Int,\n  endMapIndex: Int,\n  startPartition: Int,\n  endPartition: Int): Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]\n

                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                    • SortShuffleManager is requested for a ShuffleReader
                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/MapOutputTracker/#unregistershuffle","title":"unregisterShuffle
                                                                                                                                                                                                                    unregisterShuffle(\n  shuffleId: Int): Unit\n

                                                                                                                                                                                                                    Deletes map output status information for the specified shuffle stage

                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                    • ContextCleaner is requested to doCleanupShuffle
                                                                                                                                                                                                                    • BlockManagerSlaveEndpoint is requested to handle a RemoveShuffle message
                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/MapOutputTracker/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                    • MapOutputTrackerMaster
                                                                                                                                                                                                                    • MapOutputTrackerWorker
                                                                                                                                                                                                                    "},{"location":"scheduler/MapOutputTracker/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                    MapOutputTracker takes the following to be created:

                                                                                                                                                                                                                    • SparkConf Abstract Class

                                                                                                                                                                                                                      MapOutputTracker\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete MapOutputTrackers.

                                                                                                                                                                                                                      "},{"location":"scheduler/MapOutputTracker/#accessing-mapoutputtracker","title":"Accessing MapOutputTracker","text":"

                                                                                                                                                                                                                      MapOutputTracker is available using SparkEnv (on the driver and executors).

                                                                                                                                                                                                                      SparkEnv.get.mapOutputTracker\n
                                                                                                                                                                                                                      "},{"location":"scheduler/MapOutputTracker/#mapoutputtracker-rpc-endpoint","title":"MapOutputTracker RPC Endpoint

                                                                                                                                                                                                                      trackerEndpoint is a RpcEndpointRef of the MapOutputTracker RPC endpoint.

                                                                                                                                                                                                                      trackerEndpoint is initialized (registered or looked up) when SparkEnv is created for the driver and executors.

                                                                                                                                                                                                                      trackerEndpoint is used to communicate (synchronously).

                                                                                                                                                                                                                      trackerEndpoint is cleared (null) when MapOutputTrackerMaster is requested to stop.

                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTracker/#deregistering-map-output-status-information-of-shuffle-stage","title":"Deregistering Map Output Status Information of Shuffle Stage
                                                                                                                                                                                                                      unregisterShuffle(\n  shuffleId: Int): Unit\n

                                                                                                                                                                                                                      Deregisters map output status information for the given shuffle stage

                                                                                                                                                                                                                      Used when:

                                                                                                                                                                                                                      • ContextCleaner is requested for shuffle cleanup

                                                                                                                                                                                                                      • BlockManagerSlaveEndpoint is requested to remove a shuffle

                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTracker/#stopping-mapoutputtracker","title":"Stopping MapOutputTracker
                                                                                                                                                                                                                      stop(): Unit\n

                                                                                                                                                                                                                      stop does nothing at all.

                                                                                                                                                                                                                      stop is used when SparkEnv is requested to stop (and stops all the services, incl. MapOutputTracker).

                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTracker/#converting-mapstatuses-to-blockmanagerids-with-shuffleblockids-and-their-sizes","title":"Converting MapStatuses To BlockManagerIds with ShuffleBlockIds and Their Sizes
                                                                                                                                                                                                                      convertMapStatuses(\n  shuffleId: Int,\n  startPartition: Int,\n  endPartition: Int,\n  statuses: Array[MapStatus]): Seq[(BlockManagerId, Seq[(BlockId, Long)])]\n

                                                                                                                                                                                                                      convertMapStatuses iterates over the input statuses array (of MapStatus entries indexed by map id) and creates a collection of BlockManagerIds (for each MapStatus entry) with a ShuffleBlockId (with the input shuffleId, a mapId, and partition ranging from the input startPartition and endPartition) and estimated size for the reduce block for every status and partitions.

                                                                                                                                                                                                                      For any empty MapStatus, convertMapStatuses prints out the following ERROR message to the logs:

                                                                                                                                                                                                                      Missing an output location for shuffle [id]\n

                                                                                                                                                                                                                      And convertMapStatuses throws a MetadataFetchFailedException (with shuffleId, startPartition, and the above error message).

                                                                                                                                                                                                                      convertMapStatuses is used when:

                                                                                                                                                                                                                      • MapOutputTrackerMaster is requested for the sizes of shuffle map outputs by executor and range
                                                                                                                                                                                                                      • MapOutputTrackerWorker is requested to sizes of shuffle map outputs by executor and range
                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTracker/#sending-blocking-messages-to-trackerendpoint-rpcendpointref","title":"Sending Blocking Messages To trackerEndpoint RpcEndpointRef
                                                                                                                                                                                                                      askTracker[T](message: Any): T\n

                                                                                                                                                                                                                      askTracker sends the input message to trackerEndpoint RpcEndpointRef and waits for a result.

                                                                                                                                                                                                                      When an exception happens, askTracker prints out the following ERROR message to the logs and throws a SparkException.

                                                                                                                                                                                                                      Error communicating with MapOutputTracker\n

                                                                                                                                                                                                                      askTracker is used when MapOutputTracker is requested to fetches map outputs for ShuffleDependency remotely and sends a one-way message.

                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTracker/#epoch","title":"Epoch

                                                                                                                                                                                                                      Starts from 0 when MapOutputTracker is created.

                                                                                                                                                                                                                      Can be updated (on MapOutputTrackerWorkers) or incremented (on the driver's MapOutputTrackerMaster).

                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTracker/#sendtracker","title":"sendTracker
                                                                                                                                                                                                                      sendTracker(\n  message: Any): Unit\n

                                                                                                                                                                                                                      sendTracker...FIXME

                                                                                                                                                                                                                      sendTracker is used when:

                                                                                                                                                                                                                      • MapOutputTrackerMaster is requested to stop
                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTracker/#utilities","title":"Utilities","text":""},{"location":"scheduler/MapOutputTracker/#serializemapstatuses","title":"serializeMapStatuses
                                                                                                                                                                                                                      serializeMapStatuses(\n  statuses: Array[MapStatus],\n  broadcastManager: BroadcastManager,\n  isLocal: Boolean,\n  minBroadcastSize: Int,\n  conf: SparkConf): (Array[Byte], Broadcast[Array[Byte]])\n

                                                                                                                                                                                                                      serializeMapStatuses serializes the given array of map output locations into an efficient byte format (to send to reduce tasks). serializeMapStatuses compresses the serialized bytes using GZIP. They are supposed to be pretty compressible because many map outputs will be on the same hostname.

                                                                                                                                                                                                                      Internally, serializeMapStatuses creates a Java ByteArrayOutputStream.

                                                                                                                                                                                                                      serializeMapStatuses writes out 0 (direct) first.

                                                                                                                                                                                                                      serializeMapStatuses creates a Java GZIPOutputStream (with the ByteArrayOutputStream created) and writes out the given statuses array.

                                                                                                                                                                                                                      serializeMapStatuses decides whether to return the output array (of the output stream) or use a broadcast variable based on the size of the byte array.

                                                                                                                                                                                                                      If the size of the result byte array is the given minBroadcastSize threshold or bigger, serializeMapStatuses requests the input BroadcastManager to create a broadcast variable.

                                                                                                                                                                                                                      serializeMapStatuses resets the ByteArrayOutputStream and starts over.

                                                                                                                                                                                                                      serializeMapStatuses writes out 1 (broadcast) first.

                                                                                                                                                                                                                      serializeMapStatuses creates a new Java GZIPOutputStream (with the ByteArrayOutputStream created) and writes out the broadcast variable.

                                                                                                                                                                                                                      serializeMapStatuses prints out the following INFO message to the logs:

                                                                                                                                                                                                                      Broadcast mapstatuses size = [length], actual size = [length]\n

                                                                                                                                                                                                                      serializeMapStatuses is used when ShuffleStatus is requested to serialize shuffle map output statuses.

                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTracker/#deserializemapstatuses","title":"deserializeMapStatuses
                                                                                                                                                                                                                      deserializeMapStatuses(\n  bytes: Array[Byte],\n  conf: SparkConf): Array[MapStatus]\n

                                                                                                                                                                                                                      deserializeMapStatuses...FIXME

                                                                                                                                                                                                                      deserializeMapStatuses is used when:

                                                                                                                                                                                                                      • MapOutputTrackerWorker is requested to getStatuses
                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/","title":"MapOutputTrackerMaster","text":"

                                                                                                                                                                                                                      MapOutputTrackerMaster is a MapOutputTracker for the driver.

                                                                                                                                                                                                                      MapOutputTrackerMaster is the source of truth of shuffle map output locations.

                                                                                                                                                                                                                      "},{"location":"scheduler/MapOutputTrackerMaster/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                      MapOutputTrackerMaster takes the following to be created:

                                                                                                                                                                                                                      • SparkConf
                                                                                                                                                                                                                      • BroadcastManager
                                                                                                                                                                                                                      • isLocal flag (to indicate whether MapOutputTrackerMaster runs in local or a cluster)

                                                                                                                                                                                                                        When created, MapOutputTrackerMaster starts dispatcher threads on the map-output-dispatcher thread pool.

                                                                                                                                                                                                                        MapOutputTrackerMaster is created\u00a0when:

                                                                                                                                                                                                                        • SparkEnv utility is used to create a SparkEnv for the driver
                                                                                                                                                                                                                        "},{"location":"scheduler/MapOutputTrackerMaster/#maxrpcmessagesize","title":"maxRpcMessageSize

                                                                                                                                                                                                                        maxRpcMessageSize is...FIXME

                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#broadcastmanager","title":"BroadcastManager

                                                                                                                                                                                                                        MapOutputTrackerMaster is given a BroadcastManager to be created.

                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#shuffle-map-output-status-registry","title":"Shuffle Map Output Status Registry

                                                                                                                                                                                                                        MapOutputTrackerMaster uses an internal registry of ShuffleStatuses by shuffle stages.

                                                                                                                                                                                                                        MapOutputTrackerMaster adds a new shuffle when requested to register one (when DAGScheduler is requested to create a ShuffleMapStage for a ShuffleDependency).

                                                                                                                                                                                                                        MapOutputTrackerMaster uses the registry when requested for the following:

                                                                                                                                                                                                                        • registerMapOutput

                                                                                                                                                                                                                        • getStatistics

                                                                                                                                                                                                                        • MessageLoop

                                                                                                                                                                                                                        • unregisterMapOutput, unregisterAllMapOutput, unregisterShuffle, removeOutputsOnHost, removeOutputsOnExecutor, containsShuffle, getNumAvailableOutputs, findMissingPartitions, getLocationsWithLargestOutputs, getMapSizesByExecutorId

                                                                                                                                                                                                                        MapOutputTrackerMaster removes (clears) all shuffles when requested to stop.

                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#configuration-properties","title":"Configuration Properties

                                                                                                                                                                                                                        MapOutputTrackerMaster uses the following configuration properties:

                                                                                                                                                                                                                        • spark.shuffle.mapOutput.minSizeForBroadcast

                                                                                                                                                                                                                        • spark.shuffle.mapOutput.dispatcher.numThreads

                                                                                                                                                                                                                        • spark.shuffle.reduceLocality.enabled","text":""},{"location":"scheduler/MapOutputTrackerMaster/#map-and-reduce-task-thresholds-for-preferred-locations","title":"Map and Reduce Task Thresholds for Preferred Locations

                                                                                                                                                                                                                          MapOutputTrackerMaster defines 1000 (tasks) as the hardcoded threshold of the number of map and reduce tasks when requested to compute preferred locations with spark.shuffle.reduceLocality.enabled.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#map-output-threshold-for-preferred-location-of-reduce-tasks","title":"Map Output Threshold for Preferred Location of Reduce Tasks

                                                                                                                                                                                                                          MapOutputTrackerMaster defines 0.2 as the fraction of total map output that must be at a location for it to considered as a preferred location for a reduce task.

                                                                                                                                                                                                                          Making this larger will focus on fewer locations where most data can be read locally, but may lead to more delay in scheduling if those locations are busy.

                                                                                                                                                                                                                          MapOutputTrackerMaster uses the fraction when requested for the preferred locations of shuffle RDDs.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#getmapoutputmessage-queue","title":"GetMapOutputMessage Queue

                                                                                                                                                                                                                          MapOutputTrackerMaster uses a blocking queue (a Java LinkedBlockingQueue) for requests for map output statuses.

                                                                                                                                                                                                                          GetMapOutputMessage(\n  shuffleId: Int,\n  context: RpcCallContext)\n

                                                                                                                                                                                                                          GetMapOutputMessage holds the shuffle ID and the RpcCallContext of the caller.

                                                                                                                                                                                                                          A new GetMapOutputMessage is added to the queue when MapOutputTrackerMaster is requested to post one.

                                                                                                                                                                                                                          MapOutputTrackerMaster uses MessageLoop Dispatcher Threads to process GetMapOutputMessages.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#messageloop-dispatcher-thread","title":"MessageLoop Dispatcher Thread

                                                                                                                                                                                                                          MessageLoop is a thread of execution to handle GetMapOutputMessages until a PoisonPill marker message arrives (when MapOutputTrackerMaster is requested to stop).

                                                                                                                                                                                                                          MessageLoop takes a GetMapOutputMessage and prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                          Handling request to send map output locations for shuffle [shuffleId] to [hostPort]\n

                                                                                                                                                                                                                          MessageLoop then finds the ShuffleStatus by the shuffle ID in the shuffleStatuses internal registry and replies back (to the RPC client) with a serialized map output status (with the BroadcastManager and spark.shuffle.mapOutput.minSizeForBroadcast configuration property).

                                                                                                                                                                                                                          MessageLoop threads run on the map-output-dispatcher Thread Pool.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#map-output-dispatcher-thread-pool","title":"map-output-dispatcher Thread Pool
                                                                                                                                                                                                                          threadpool: ThreadPoolExecutor\n

                                                                                                                                                                                                                          threadpool is a daemon fixed thread pool registered with map-output-dispatcher thread name prefix.

                                                                                                                                                                                                                          threadpool uses spark.shuffle.mapOutput.dispatcher.numThreads configuration property for the number of MessageLoop dispatcher threads to process received GetMapOutputMessage messages.

                                                                                                                                                                                                                          The dispatcher threads are started immediately when MapOutputTrackerMaster is created.

                                                                                                                                                                                                                          The thread pool is shut down when MapOutputTrackerMaster is requested to stop.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#epoch-number","title":"Epoch Number

                                                                                                                                                                                                                          MapOutputTrackerMaster uses an epoch number to...FIXME

                                                                                                                                                                                                                          getEpoch is used when:

                                                                                                                                                                                                                          • DAGScheduler is requested to removeExecutorAndUnregisterOutputs

                                                                                                                                                                                                                          • TaskSetManager is created (and sets the epoch to tasks)

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#enqueueing-getmapoutputmessage","title":"Enqueueing GetMapOutputMessage
                                                                                                                                                                                                                          post(\n  message: GetMapOutputMessage): Unit\n

                                                                                                                                                                                                                          post simply adds the input GetMapOutputMessage to the mapOutputRequests internal queue.

                                                                                                                                                                                                                          post is used when MapOutputTrackerMasterEndpoint is requested to handle a GetMapOutputStatuses message.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#stopping-mapoutputtrackermaster","title":"Stopping MapOutputTrackerMaster
                                                                                                                                                                                                                          stop(): Unit\n

                                                                                                                                                                                                                          stop...FIXME

                                                                                                                                                                                                                          stop is part of the MapOutputTracker abstraction.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#unregistering-shuffle-map-output","title":"Unregistering Shuffle Map Output
                                                                                                                                                                                                                          unregisterMapOutput(\n  shuffleId: Int,\n  mapId: Int,\n  bmAddress: BlockManagerId): Unit\n

                                                                                                                                                                                                                          unregisterMapOutput...FIXME

                                                                                                                                                                                                                          unregisterMapOutput is used when DAGScheduler is requested to handle a task completion (due to a fetch failure).

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#computing-preferred-locations","title":"Computing Preferred Locations
                                                                                                                                                                                                                          getPreferredLocationsForShuffle(\n  dep: ShuffleDependency[_, _, _],\n  partitionId: Int): Seq[String]\n

                                                                                                                                                                                                                          getPreferredLocationsForShuffle computes the locations (BlockManagers) with the most shuffle map outputs for the input ShuffleDependency and Partition.

                                                                                                                                                                                                                          getPreferredLocationsForShuffle computes the locations when all of the following are met:

                                                                                                                                                                                                                          • spark.shuffle.reduceLocality.enabled configuration property is enabled

                                                                                                                                                                                                                          • The number of \"map\" partitions (of the RDD of the input ShuffleDependency) is below SHUFFLE_PREF_MAP_THRESHOLD

                                                                                                                                                                                                                          • The number of \"reduce\" partitions (of the Partitioner of the input ShuffleDependency) is below SHUFFLE_PREF_REDUCE_THRESHOLD

                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                          getPreferredLocationsForShuffle is simply getLocationsWithLargestOutputs with a guard condition.

                                                                                                                                                                                                                          Internally, getPreferredLocationsForShuffle checks whether spark.shuffle.reduceLocality.enabled configuration property is enabled with the number of partitions of the RDD of the input ShuffleDependency and partitions in the partitioner of the input ShuffleDependency both being less than 1000.

                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                          The thresholds for the number of partitions in the RDD and of the partitioner when computing the preferred locations are 1000 and are not configurable.

                                                                                                                                                                                                                          If the condition holds, getPreferredLocationsForShuffle finds locations with the largest number of shuffle map outputs for the input ShuffleDependency and partitionId (with the number of partitions in the partitioner of the input ShuffleDependency and 0.2) and returns the hosts of the preferred BlockManagers.

                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                          0.2 is the fraction of total map output that must be at a location to be considered as a preferred location for a reduce task. It is not configurable.

                                                                                                                                                                                                                          getPreferredLocationsForShuffle is used when ShuffledRDD and Spark SQL's ShuffledRowRDD are requested for preferred locations of a partition.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-locations-with-largest-number-of-shuffle-map-outputs","title":"Finding Locations with Largest Number of Shuffle Map Outputs
                                                                                                                                                                                                                          getLocationsWithLargestOutputs(\n  shuffleId: Int,\n  reducerId: Int,\n  numReducers: Int,\n  fractionThreshold: Double): Option[Array[BlockManagerId]]\n

                                                                                                                                                                                                                          getLocationsWithLargestOutputs returns BlockManagerIds with the largest size (of all the shuffle blocks they manage) above the input fractionThreshold (given the total size of all the shuffle blocks for the shuffle across all BlockManagers).

                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                          getLocationsWithLargestOutputs may return no BlockManagerId if their shuffle blocks do not total up above the input fractionThreshold.

                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                          The input numReducers is not used.

                                                                                                                                                                                                                          Internally, getLocationsWithLargestOutputs queries the mapStatuses internal cache for the input shuffleId.

                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                          One entry in mapStatuses internal cache is a MapStatus array indexed by partition id.

                                                                                                                                                                                                                          MapStatus includes information about the BlockManager (as BlockManagerId) and estimated size of the reduce blocks.

                                                                                                                                                                                                                          getLocationsWithLargestOutputs iterates over the MapStatus array and builds an interim mapping between BlockManagerId and the cumulative sum of shuffle blocks across BlockManagers.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#incrementing-epoch","title":"Incrementing Epoch
                                                                                                                                                                                                                          incrementEpoch(): Unit\n

                                                                                                                                                                                                                          incrementEpoch increments the internal epoch.

                                                                                                                                                                                                                          incrementEpoch prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                          Increasing epoch to [epoch]\n

                                                                                                                                                                                                                          incrementEpoch is used when:

                                                                                                                                                                                                                          • MapOutputTrackerMaster is requested to unregisterMapOutput, unregisterAllMapOutput, removeOutputsOnHost and removeOutputsOnExecutor

                                                                                                                                                                                                                          • DAGScheduler is requested to handle a ShuffleMapTask completion (of a ShuffleMapStage)

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#checking-availability-of-shuffle-map-output-status","title":"Checking Availability of Shuffle Map Output Status
                                                                                                                                                                                                                          containsShuffle(\n  shuffleId: Int): Boolean\n

                                                                                                                                                                                                                          containsShuffle checks if the input shuffleId is registered in the cachedSerializedStatuses or mapStatuses internal caches.

                                                                                                                                                                                                                          containsShuffle is used when DAGScheduler is requested to create a createShuffleMapStage (for a ShuffleDependency).

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#registering-shuffle","title":"Registering Shuffle
                                                                                                                                                                                                                          registerShuffle(\n  shuffleId: Int,\n  numMaps: Int): Unit\n

                                                                                                                                                                                                                          registerShuffle registers a new ShuffleStatus (for the given shuffle ID and the number of partitions) to the shuffleStatuses internal registry.

                                                                                                                                                                                                                          registerShuffle throws an IllegalArgumentException when the shuffle ID has already been registered:

                                                                                                                                                                                                                          Shuffle ID [shuffleId] registered twice\n

                                                                                                                                                                                                                          registerShuffle is used when:

                                                                                                                                                                                                                          • DAGScheduler is requested to create a ShuffleMapStage (for a ShuffleDependency)
                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#registering-map-outputs-for-shuffle-possibly-with-epoch-change","title":"Registering Map Outputs for Shuffle (Possibly with Epoch Change)
                                                                                                                                                                                                                          registerMapOutputs(\n  shuffleId: Int,\n  statuses: Array[MapStatus],\n  changeEpoch: Boolean = false): Unit\n

                                                                                                                                                                                                                          registerMapOutputs registers the input statuses (as the shuffle map output) with the input shuffleId in the mapStatuses internal cache.

                                                                                                                                                                                                                          registerMapOutputs increments epoch if the input changeEpoch is enabled (it is not by default).

                                                                                                                                                                                                                          registerMapOutputs is used when DAGScheduler handles successful ShuffleMapTask completion and executor lost events.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-serialized-map-output-statuses-and-possibly-broadcasting-them","title":"Finding Serialized Map Output Statuses (And Possibly Broadcasting Them)
                                                                                                                                                                                                                          getSerializedMapOutputStatuses(\n  shuffleId: Int): Array[Byte]\n

                                                                                                                                                                                                                          getSerializedMapOutputStatuses finds cached serialized map statuses for the input shuffleId.

                                                                                                                                                                                                                          If found, getSerializedMapOutputStatuses returns the cached serialized map statuses.

                                                                                                                                                                                                                          Otherwise, getSerializedMapOutputStatuses acquires the shuffle lock for shuffleId and finds cached serialized map statuses again since some other thread could not update the cachedSerializedStatuses internal cache.

                                                                                                                                                                                                                          getSerializedMapOutputStatuses returns the serialized map statuses if found.

                                                                                                                                                                                                                          If not, getSerializedMapOutputStatuses serializes the local array of MapStatuses (from checkCachedStatuses).

                                                                                                                                                                                                                          getSerializedMapOutputStatuses prints out the following INFO message to the logs:

                                                                                                                                                                                                                          Size of output statuses for shuffle [shuffleId] is [bytes] bytes\n

                                                                                                                                                                                                                          getSerializedMapOutputStatuses saves the serialized map output statuses in cachedSerializedStatuses internal cache if the epoch has not changed in the meantime. getSerializedMapOutputStatuses also saves its broadcast version in cachedSerializedBroadcast internal cache.

                                                                                                                                                                                                                          If the epoch has changed in the meantime, the serialized map output statuses and their broadcast version are not saved, and getSerializedMapOutputStatuses prints out the following INFO message to the logs:

                                                                                                                                                                                                                          Epoch changed, not caching!\n

                                                                                                                                                                                                                          getSerializedMapOutputStatuses removes the broadcast.

                                                                                                                                                                                                                          getSerializedMapOutputStatuses returns the serialized map statuses.

                                                                                                                                                                                                                          getSerializedMapOutputStatuses is used when MapOutputTrackerMaster responds to GetMapOutputMessage requests and DAGScheduler creates ShuffleMapStage for ShuffleDependency (copying the shuffle map output locations from previous jobs to avoid unnecessarily regenerating data).

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-cached-serialized-map-statuses","title":"Finding Cached Serialized Map Statuses
                                                                                                                                                                                                                          checkCachedStatuses(): Boolean\n

                                                                                                                                                                                                                          checkCachedStatuses is an internal helper method that <> uses to do some bookkeeping (when the <> and <> differ) and set local statuses, retBytes and epochGotten (that getSerializedMapOutputStatuses uses).

                                                                                                                                                                                                                          Internally, checkCachedStatuses acquires the MapOutputTracker.md#epochLock[epochLock lock] and checks the status of <> to <cacheEpoch>>.

                                                                                                                                                                                                                          If epoch is younger (i.e. greater), checkCachedStatuses clears <> internal cache, <> and sets cacheEpoch to be epoch.

                                                                                                                                                                                                                          checkCachedStatuses gets the serialized map output statuses for the shuffleId (of the owning <>).

                                                                                                                                                                                                                          When the serialized map output status is found, checkCachedStatuses saves it in a local retBytes and returns true.

                                                                                                                                                                                                                          When not found, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                                          cached status not found for : [shuffleId]\n

                                                                                                                                                                                                                          checkCachedStatuses uses MapOutputTracker.md#mapStatuses[mapStatuses] internal cache to get map output statuses for the shuffleId (of the owning <>) or falls back to an empty array and sets it to a local statuses. checkCachedStatuses sets the local epochGotten to the current <> and returns false.","text":""},{"location":"scheduler/MapOutputTrackerMaster/#registering-shuffle-map-output","title":"Registering Shuffle Map Output

                                                                                                                                                                                                                          registerMapOutput(\n  shuffleId: Int,\n  mapId: Int,\n  status: MapStatus): Unit\n

                                                                                                                                                                                                                          registerMapOutput finds the ShuffleStatus by the given shuffle ID and adds the given MapStatus:

                                                                                                                                                                                                                          • The given mapId is the partitionId of the ShuffleMapTask that finished.

                                                                                                                                                                                                                          • The given shuffleId is the shuffleId of the ShuffleDependency of the ShuffleMapStage (for which the ShuffleMapTask completed)

                                                                                                                                                                                                                          registerMapOutput is used when DAGScheduler is requested to handle a ShuffleMapTask completion.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#map-output-statistics-for-shuffledependency","title":"Map Output Statistics for ShuffleDependency
                                                                                                                                                                                                                          getStatistics(\n  dep: ShuffleDependency[_, _, _]): MapOutputStatistics\n

                                                                                                                                                                                                                          getStatistics requests the input ShuffleDependency for the shuffle ID and looks up the corresponding ShuffleStatus (in the shuffleStatuses registry).

                                                                                                                                                                                                                          getStatistics assumes that the ShuffleStatus is in shuffleStatuses registry.

                                                                                                                                                                                                                          getStatistics requests the ShuffleStatus for the MapStatuses (of the ShuffleDependency).

                                                                                                                                                                                                                          getStatistics uses the spark.shuffle.mapOutput.parallelAggregationThreshold configuration property to decide on parallelism to calculate the statistics.

                                                                                                                                                                                                                          With no parallelism, getStatistics simply traverses over the MapStatuses and requests them (one by one) for the size of every shuffle block.

                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                          getStatistics requests the given ShuffleDependency for the Partitioner that in turn is requested for the number of partitions.

                                                                                                                                                                                                                          The number of blocks is the number of MapStatuses multiplied by the number of partitions.

                                                                                                                                                                                                                          And hence the need for parallelism based on the spark.shuffle.mapOutput.parallelAggregationThreshold configuration property.

                                                                                                                                                                                                                          In the end, getStatistics creates a MapOutputStatistics with the shuffle ID (of the given ShuffleDependency) and the total sizes (sumed up for every partition).

                                                                                                                                                                                                                          getStatistics is used when:

                                                                                                                                                                                                                          • DAGScheduler is requested to handle a successful ShuffleMapStage submission and markMapStageJobsAsFinished
                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#deregistering-all-map-outputs-of-shuffle-stage","title":"Deregistering All Map Outputs of Shuffle Stage
                                                                                                                                                                                                                          unregisterAllMapOutput(\n  shuffleId: Int): Unit\n

                                                                                                                                                                                                                          unregisterAllMapOutput...FIXME

                                                                                                                                                                                                                          unregisterAllMapOutput is used when DAGScheduler is requested to handle a task completion (due to a fetch failure).

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#deregistering-shuffle","title":"Deregistering Shuffle
                                                                                                                                                                                                                          unregisterShuffle(\n  shuffleId: Int): Unit\n

                                                                                                                                                                                                                          unregisterShuffle...FIXME

                                                                                                                                                                                                                          unregisterShuffle is part of the MapOutputTracker abstraction.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#deregistering-shuffle-outputs-associated-with-host","title":"Deregistering Shuffle Outputs Associated with Host
                                                                                                                                                                                                                          removeOutputsOnHost(\n  host: String): Unit\n

                                                                                                                                                                                                                          removeOutputsOnHost...FIXME

                                                                                                                                                                                                                          removeOutputsOnHost is used when DAGScheduler is requested to removeExecutorAndUnregisterOutputs and handle a worker removal.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#deregistering-shuffle-outputs-associated-with-executor","title":"Deregistering Shuffle Outputs Associated with Executor
                                                                                                                                                                                                                          removeOutputsOnExecutor(\n  execId: String): Unit\n

                                                                                                                                                                                                                          removeOutputsOnExecutor...FIXME

                                                                                                                                                                                                                          removeOutputsOnExecutor is used when DAGScheduler is requested to removeExecutorAndUnregisterOutputs.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#number-of-partitions-with-shuffle-map-outputs-available","title":"Number of Partitions with Shuffle Map Outputs Available
                                                                                                                                                                                                                          getNumAvailableOutputs(\n  shuffleId: Int): Int\n

                                                                                                                                                                                                                          getNumAvailableOutputs...FIXME

                                                                                                                                                                                                                          getNumAvailableOutputs is used when ShuffleMapStage is requested for the number of partitions with shuffle outputs available.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-missing-partitions","title":"Finding Missing Partitions
                                                                                                                                                                                                                          findMissingPartitions(\n  shuffleId: Int): Option[Seq[Int]]\n

                                                                                                                                                                                                                          findMissingPartitions...FIXME

                                                                                                                                                                                                                          findMissingPartitions is used when ShuffleMapStage is requested for missing partitions.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-locations-with-blocks-and-sizes","title":"Finding Locations with Blocks and Sizes
                                                                                                                                                                                                                          getMapSizesByExecutorId(\n  shuffleId: Int,\n  startPartition: Int,\n  endPartition: Int): Iterator[(BlockManagerId, Seq[(BlockId, Long)])]\n

                                                                                                                                                                                                                          getMapSizesByExecutorId is part of the MapOutputTracker abstraction.

                                                                                                                                                                                                                          getMapSizesByExecutorId returns a collection of BlockManagerIds with their blocks and sizes.

                                                                                                                                                                                                                          When executed, getMapSizesByExecutorId prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                          Fetching outputs for shuffle [id], partitions [startPartition]-[endPartition]\n

                                                                                                                                                                                                                          getMapSizesByExecutorId finds map outputs for the input shuffleId.

                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                          getMapSizesByExecutorId gets the map outputs for all the partitions (despite the method's signature).

                                                                                                                                                                                                                          In the end, getMapSizesByExecutorId converts shuffle map outputs (as MapStatuses) into the collection of BlockManagerIds with their blocks and sizes.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#logging","title":"Logging

                                                                                                                                                                                                                          Enable ALL logging level for org.apache.spark.MapOutputTrackerMaster logger to see what happens inside.

                                                                                                                                                                                                                          Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                          log4j.logger.org.apache.spark.MapOutputTrackerMaster=ALL\n

                                                                                                                                                                                                                          Refer to Logging.

                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/MapOutputTrackerMasterEndpoint/","title":"MapOutputTrackerMasterEndpoint","text":"

                                                                                                                                                                                                                          MapOutputTrackerMasterEndpoint is an RpcEndpoint for MapOutputTrackerMaster.

                                                                                                                                                                                                                          MapOutputTrackerMasterEndpoint is registered under the name of MapOutputTracker (on the driver).

                                                                                                                                                                                                                          "},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                          MapOutputTrackerMasterEndpoint takes the following to be created:

                                                                                                                                                                                                                          • RpcEnv
                                                                                                                                                                                                                          • MapOutputTrackerMaster
                                                                                                                                                                                                                          • SparkConf

                                                                                                                                                                                                                            MapOutputTrackerMasterEndpoint is created\u00a0when:

                                                                                                                                                                                                                            • SparkEnv is created (for the driver and executors)

                                                                                                                                                                                                                            While being created, MapOutputTrackerMasterEndpoint prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                            init\n
                                                                                                                                                                                                                            "},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#messages","title":"Messages","text":""},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#getmapoutputstatuses","title":"GetMapOutputStatuses
                                                                                                                                                                                                                            GetMapOutputStatuses(\n  shuffleId: Int)\n

                                                                                                                                                                                                                            Posted when MapOutputTrackerWorker is requested for shuffle map outputs for a given shuffle ID

                                                                                                                                                                                                                            When received, MapOutputTrackerMasterEndpoint prints out the following INFO message to the logs:

                                                                                                                                                                                                                            Asked to send map output locations for shuffle [shuffleId] to [hostPort]\n

                                                                                                                                                                                                                            In the end, MapOutputTrackerMasterEndpoint requests the MapOutputTrackerMaster to post a GetMapOutputMessage (with the input shuffleId). Whatever is returned from MapOutputTrackerMaster becomes the response.

                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#stopmapoutputtracker","title":"StopMapOutputTracker

                                                                                                                                                                                                                            Posted when MapOutputTrackerMaster is requested to stop.

                                                                                                                                                                                                                            When received, MapOutputTrackerMasterEndpoint prints out the following INFO message to the logs:

                                                                                                                                                                                                                            MapOutputTrackerMasterEndpoint stopped!\n

                                                                                                                                                                                                                            MapOutputTrackerMasterEndpoint confirms the request (by replying true) and stops.

                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#logging","title":"Logging

                                                                                                                                                                                                                            Enable ALL logging level for org.apache.spark.MapOutputTrackerMasterEndpoint logger to see what happens inside.

                                                                                                                                                                                                                            Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                            log4j.logger.org.apache.spark.MapOutputTrackerMasterEndpoint=ALL\n

                                                                                                                                                                                                                            Refer to Logging.

                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/MapOutputTrackerWorker/","title":"MapOutputTrackerWorker","text":"

                                                                                                                                                                                                                            MapOutputTrackerWorker is the MapOutputTracker for executors.

                                                                                                                                                                                                                            MapOutputTrackerWorker uses Java's thread-safe java.util.concurrent.ConcurrentHashMap for mapStatuses internal cache and any lookup cache miss triggers a fetch from the driver's MapOutputTrackerMaster.

                                                                                                                                                                                                                            == [[getStatuses]] Finding Shuffle Map Outputs

                                                                                                                                                                                                                            "},{"location":"scheduler/MapOutputTrackerWorker/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                            getStatuses( shuffleId: Int): Array[MapStatus]

                                                                                                                                                                                                                            getStatuses finds MapStatus.md[MapStatuses] for the input shuffleId in the <> internal cache and, when not available, fetches them from a remote MapOutputTrackerMaster.md[MapOutputTrackerMaster] (using RPC).

                                                                                                                                                                                                                            Internally, getStatuses first queries the <mapStatuses internal cache>> and returns the map outputs if found.

                                                                                                                                                                                                                            If not found (in the mapStatuses internal cache), you should see the following INFO message in the logs:

                                                                                                                                                                                                                            Don't have map outputs for shuffle [id], fetching them\n

                                                                                                                                                                                                                            If some other process fetches the map outputs for the shuffleId (as recorded in fetching internal registry), getStatuses waits until it is done.

                                                                                                                                                                                                                            When no other process fetches the map outputs, getStatuses registers the input shuffleId in fetching internal registry (of shuffle map outputs being fetched).

                                                                                                                                                                                                                            You should see the following INFO message in the logs:

                                                                                                                                                                                                                            Doing the fetch; tracker endpoint = [trackerEndpoint]\n

                                                                                                                                                                                                                            getStatuses sends a GetMapOutputStatuses RPC remote message for the input shuffleId to the trackerEndpoint expecting a Array[Byte].

                                                                                                                                                                                                                            NOTE: getStatuses requests shuffle map outputs remotely within a timeout and with retries. Refer to rpc:RpcEndpointRef.md[RpcEndpointRef].

                                                                                                                                                                                                                            getStatuses <> and records the result in the <mapStatuses internal cache>>.

                                                                                                                                                                                                                            You should see the following INFO message in the logs:

                                                                                                                                                                                                                            Got the output locations\n

                                                                                                                                                                                                                            getStatuses removes the input shuffleId from fetching internal registry.

                                                                                                                                                                                                                            You should see the following DEBUG message in the logs:

                                                                                                                                                                                                                            Fetching map output statuses for shuffle [id] took [time] ms\n

                                                                                                                                                                                                                            If getStatuses could not find the map output locations for the input shuffleId (locally and remotely), you should see the following ERROR message in the logs and throws a MetadataFetchFailedException.

                                                                                                                                                                                                                            Missing all output locations for shuffle [id]\n

                                                                                                                                                                                                                            NOTE: getStatuses is used when MapOutputTracker <> and <ShuffleDependency>>.

                                                                                                                                                                                                                            == [[logging]] Logging

                                                                                                                                                                                                                            Enable ALL logging level for org.apache.spark.MapOutputTrackerWorker logger to see what happens inside.

                                                                                                                                                                                                                            Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                            "},{"location":"scheduler/MapOutputTrackerWorker/#source","title":"[source]","text":""},{"location":"scheduler/MapOutputTrackerWorker/#log4jloggerorgapachesparkmapoutputtrackerworkerall","title":"log4j.logger.org.apache.spark.MapOutputTrackerWorker=ALL","text":"

                                                                                                                                                                                                                            Refer to spark-logging.md[Logging].

                                                                                                                                                                                                                            "},{"location":"scheduler/MapStatus/","title":"MapStatus","text":"

                                                                                                                                                                                                                            MapStatus is an abstraction of shuffle map output statuses with an estimated size, location and map Id.

                                                                                                                                                                                                                            MapStatus is a result of executing a ShuffleMapTask.

                                                                                                                                                                                                                            After a ShuffleMapTask has finished execution successfully, DAGScheduler is requested to handle a ShuffleMapTask completion that in turn requests the MapOutputTrackerMaster to register the MapStatus.

                                                                                                                                                                                                                            "},{"location":"scheduler/MapStatus/#contract","title":"Contract","text":""},{"location":"scheduler/MapStatus/#estimated-size","title":"Estimated Size
                                                                                                                                                                                                                            getSizeForBlock(\n  reduceId: Int): Long\n

                                                                                                                                                                                                                            Estimated size (in bytes)

                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                            • MapOutputTrackerMaster is requested for a MapOutputStatistics and locations with the largest number of shuffle map outputs
                                                                                                                                                                                                                            • MapOutputTracker utility is used to convert MapStatuses
                                                                                                                                                                                                                            • OptimizeSkewedJoin (Spark SQL) physical optimization is executed
                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/MapStatus/#location","title":"Location
                                                                                                                                                                                                                            location: BlockManagerId\n

                                                                                                                                                                                                                            BlockManagerId of the shuffle map output (i.e. the BlockManager where a ShuffleMapTask ran and the result is stored)

                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                            • ShuffleStatus is requested to removeMapOutput and removeOutputsByFilter
                                                                                                                                                                                                                            • MapOutputTrackerMaster is requested for locations with the largest number of shuffle map outputs and getMapLocation
                                                                                                                                                                                                                            • MapOutputTracker utility is used to convert MapStatuses
                                                                                                                                                                                                                            • DAGScheduler is requested to handle a ShuffleMapTask completion
                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/MapStatus/#map-id","title":"Map Id
                                                                                                                                                                                                                            mapId: Long\n

                                                                                                                                                                                                                            Map Id of the shuffle map output

                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                            • MapOutputTracker utility is used to convert MapStatuses
                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/MapStatus/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                            • CompressedMapStatus
                                                                                                                                                                                                                            • HighlyCompressedMapStatus
                                                                                                                                                                                                                            Sealed Trait

                                                                                                                                                                                                                            MapStatus is a Scala sealed trait which means that all of the implementations are in the same compilation unit (a single file).

                                                                                                                                                                                                                            "},{"location":"scheduler/MapStatus/#sparkshuffleminnumpartitionstohighlycompress","title":"spark.shuffle.minNumPartitionsToHighlyCompress

                                                                                                                                                                                                                            MapStatus utility uses spark.shuffle.minNumPartitionsToHighlyCompress internal configuration property for the minimum number of partitions to prefer a HighlyCompressedMapStatus.

                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/MapStatus/#creating-mapstatus","title":"Creating MapStatus
                                                                                                                                                                                                                            apply(\n  loc: BlockManagerId,\n  uncompressedSizes: Array[Long],\n  mapTaskId: Long): MapStatus\n

                                                                                                                                                                                                                            apply creates a HighlyCompressedMapStatus when the number of uncompressedSizes is above minPartitionsToUseHighlyCompressMapStatus threshold. Otherwise, apply creates a CompressedMapStatus.

                                                                                                                                                                                                                            apply is used when:

                                                                                                                                                                                                                            • SortShuffleWriter is requested to write records
                                                                                                                                                                                                                            • BypassMergeSortShuffleWriter is requested to write records
                                                                                                                                                                                                                            • UnsafeShuffleWriter is requested to close resources and write out merged spill files
                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/Pool/","title":"Pool","text":"

                                                                                                                                                                                                                            == [[Pool]] Schedulable Pool

                                                                                                                                                                                                                            Pool is a scheduler:spark-scheduler-Schedulable.md[Schedulable] entity that represents a tree of scheduler:TaskSetManager.md[TaskSetManagers], i.e. it contains a collection of TaskSetManagers or the Pools thereof.

                                                                                                                                                                                                                            A Pool has a mandatory name, a spark-scheduler-SchedulingMode.md[scheduling mode], initial minShare and weight that are defined when it is created.

                                                                                                                                                                                                                            NOTE: An instance of Pool is created when scheduler:TaskSchedulerImpl.md#initialize[TaskSchedulerImpl is initialized].

                                                                                                                                                                                                                            NOTE: The scheduler:TaskScheduler.md#contract[TaskScheduler Contract] and spark-scheduler-Schedulable.md#contract[Schedulable Contract] both require that their entities have rootPool of type Pool.

                                                                                                                                                                                                                            === [[increaseRunningTasks]] increaseRunningTasks Method

                                                                                                                                                                                                                            CAUTION: FIXME

                                                                                                                                                                                                                            === [[decreaseRunningTasks]] decreaseRunningTasks Method

                                                                                                                                                                                                                            CAUTION: FIXME

                                                                                                                                                                                                                            === [[taskSetSchedulingAlgorithm]] taskSetSchedulingAlgorithm Attribute

                                                                                                                                                                                                                            Using the spark-scheduler-SchedulingMode.md[scheduling mode] (given when a Pool object is created), Pool selects <> and sets taskSetSchedulingAlgorithm:

                                                                                                                                                                                                                            • <> for FIFO scheduling mode.
                                                                                                                                                                                                                            • <> for FAIR scheduling mode.

                                                                                                                                                                                                                              It throws an IllegalArgumentException when unsupported scheduling mode is passed on:

                                                                                                                                                                                                                              Unsupported spark.scheduler.mode: [schedulingMode]\n

                                                                                                                                                                                                                              TIP: Read about the scheduling modes in spark-scheduler-SchedulingMode.md[SchedulingMode].

                                                                                                                                                                                                                              NOTE: taskSetSchedulingAlgorithm is used in <>.

                                                                                                                                                                                                                              === [[getSortedTaskSetQueue]] Getting TaskSetManagers Sorted -- getSortedTaskSetQueue Method

                                                                                                                                                                                                                              NOTE: getSortedTaskSetQueue is part of the spark-scheduler-Schedulable.md#contract[Schedulable Contract].

                                                                                                                                                                                                                              getSortedTaskSetQueue sorts all the spark-scheduler-Schedulable.md[Schedulables] in spark-scheduler-Schedulable.md#contract[schedulableQueue] queue by a <> (from the internal <>).

                                                                                                                                                                                                                              NOTE: It is called when scheduler:TaskSchedulerImpl.md#resourceOffers[TaskSchedulerImpl processes executor resource offers].

                                                                                                                                                                                                                              === [[schedulableNameToSchedulable]] Schedulables by Name -- schedulableNameToSchedulable Registry

                                                                                                                                                                                                                              "},{"location":"scheduler/Pool/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/Pool/#schedulablenametoschedulable-new-concurrenthashmapstring-schedulable","title":"schedulableNameToSchedulable = new ConcurrentHashMap[String, Schedulable]","text":"

                                                                                                                                                                                                                              schedulableNameToSchedulable is a lookup table of spark-scheduler-Schedulable.md[Schedulable] objects by their names.

                                                                                                                                                                                                                              Beside the obvious usage in the housekeeping methods like addSchedulable, removeSchedulable, getSchedulableByName from the spark-scheduler-Schedulable.md#contract[Schedulable Contract], it is exclusively used in SparkContext.md#getPoolForName[SparkContext.getPoolForName].

                                                                                                                                                                                                                              === [[addSchedulable]] addSchedulable Method

                                                                                                                                                                                                                              NOTE: addSchedulable is part of the spark-scheduler-Schedulable.md#contract[Schedulable Contract].

                                                                                                                                                                                                                              addSchedulable adds a Schedulable to the spark-scheduler-Schedulable.md#contract[schedulableQueue] and <>.

                                                                                                                                                                                                                              More importantly, it sets the Schedulable entity's spark-scheduler-Schedulable.md#contract[parent] to itself.

                                                                                                                                                                                                                              === [[removeSchedulable]] removeSchedulable Method

                                                                                                                                                                                                                              NOTE: removeSchedulable is part of the spark-scheduler-Schedulable.md#contract[Schedulable Contract].

                                                                                                                                                                                                                              removeSchedulable removes a Schedulable from the spark-scheduler-Schedulable.md#contract[schedulableQueue] and <>.

                                                                                                                                                                                                                              NOTE: removeSchedulable is the opposite to <addSchedulable method>>.

                                                                                                                                                                                                                              === [[SchedulingAlgorithm]] SchedulingAlgorithm

                                                                                                                                                                                                                              SchedulingAlgorithm is the interface for a sorting algorithm to sort spark-scheduler-Schedulable.md[Schedulables].

                                                                                                                                                                                                                              There are currently two SchedulingAlgorithms:

                                                                                                                                                                                                                              • <> for FIFO scheduling mode.
                                                                                                                                                                                                                              • <> for FAIR scheduling mode.

                                                                                                                                                                                                                                ==== [[FIFOSchedulingAlgorithm]] FIFOSchedulingAlgorithm

                                                                                                                                                                                                                                FIFOSchedulingAlgorithm is a scheduling algorithm that compares Schedulables by their priority first and, when equal, by their stageId.

                                                                                                                                                                                                                                NOTE: priority and stageId are part of spark-scheduler-Schedulable.md#contract[Schedulable Contract].

                                                                                                                                                                                                                                CAUTION: FIXME A picture is worth a thousand words. How to picture the algorithm?

                                                                                                                                                                                                                                ==== [[FairSchedulingAlgorithm]] FairSchedulingAlgorithm

                                                                                                                                                                                                                                FairSchedulingAlgorithm is a scheduling algorithm that compares Schedulables by their minShare, runningTasks, and weight.

                                                                                                                                                                                                                                NOTE: minShare, runningTasks, and weight are part of spark-scheduler-Schedulable.md#contract[Schedulable Contract].

                                                                                                                                                                                                                                .FairSchedulingAlgorithm image::spark-pool-FairSchedulingAlgorithm.png[align=\"center\"]

                                                                                                                                                                                                                                For each input Schedulable, minShareRatio is computed as runningTasks by minShare (but at least 1) while taskToWeightRatio is runningTasks by weight.

                                                                                                                                                                                                                                === [[getSchedulableByName]] Finding Schedulable by Name -- getSchedulableByName Method

                                                                                                                                                                                                                                "},{"location":"scheduler/Pool/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/Pool/#getschedulablebynameschedulablename-string-schedulable","title":"getSchedulableByName(schedulableName: String): Schedulable","text":"

                                                                                                                                                                                                                                NOTE: getSchedulableByName is part of the <> to find a <> by name.

                                                                                                                                                                                                                                getSchedulableByName...FIXME

                                                                                                                                                                                                                                "},{"location":"scheduler/ResultStage/","title":"ResultStage","text":"

                                                                                                                                                                                                                                ResultStage is the final stage in a job that applies a function to one or many partitions of the target RDD to compute the result of an action.

                                                                                                                                                                                                                                The partitions are given as a collection of partition ids (partitions) and the function func: (TaskContext, Iterator[_]) => _.

                                                                                                                                                                                                                                == [[findMissingPartitions]] Finding Missing Partitions

                                                                                                                                                                                                                                "},{"location":"scheduler/ResultStage/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/ResultStage/#findmissingpartitions-seqint","title":"findMissingPartitions(): Seq[Int]","text":"

                                                                                                                                                                                                                                NOTE: findMissingPartitions is part of the scheduler:Stage.md#findMissingPartitions[Stage] abstraction.

                                                                                                                                                                                                                                findMissingPartitions...FIXME

                                                                                                                                                                                                                                .ResultStage.findMissingPartitions and ActiveJob image::resultstage-findMissingPartitions.png[align=\"center\"]

                                                                                                                                                                                                                                In the above figure, partitions 1 and 2 are not finished (F is false while T is true).

                                                                                                                                                                                                                                == [[func]] func Property

                                                                                                                                                                                                                                CAUTION: FIXME

                                                                                                                                                                                                                                == [[setActiveJob]] setActiveJob Method

                                                                                                                                                                                                                                CAUTION: FIXME

                                                                                                                                                                                                                                == [[removeActiveJob]] removeActiveJob Method

                                                                                                                                                                                                                                CAUTION: FIXME

                                                                                                                                                                                                                                == [[activeJob]] activeJob Method

                                                                                                                                                                                                                                "},{"location":"scheduler/ResultStage/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/ResultStage/#activejob-optionactivejob","title":"activeJob: Option[ActiveJob]","text":"

                                                                                                                                                                                                                                activeJob returns the optional ActiveJob associated with a ResultStage.

                                                                                                                                                                                                                                CAUTION: FIXME When/why would that be NONE (empty)?

                                                                                                                                                                                                                                "},{"location":"scheduler/ResultTask/","title":"ResultTask","text":"

                                                                                                                                                                                                                                ResultTask[T, U] is a Task that executes a partition processing function on a partition with records (of type T) to produce a result (of type U) that is sent back to the driver.

                                                                                                                                                                                                                                T -- [ResultTask] --> U\n
                                                                                                                                                                                                                                "},{"location":"scheduler/ResultTask/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                ResultTask takes the following to be created:

                                                                                                                                                                                                                                • Stage ID
                                                                                                                                                                                                                                • Stage Attempt ID
                                                                                                                                                                                                                                • Broadcast variable with a serialized task (Broadcast[Array[Byte]])
                                                                                                                                                                                                                                • Partition to compute
                                                                                                                                                                                                                                • TaskLocation
                                                                                                                                                                                                                                • Output ID
                                                                                                                                                                                                                                • Local Properties
                                                                                                                                                                                                                                • Serialized TaskMetrics (Array[Byte])
                                                                                                                                                                                                                                • ActiveJob ID (optional)
                                                                                                                                                                                                                                • Application ID (optional)
                                                                                                                                                                                                                                • Application Attempt ID (optional)
                                                                                                                                                                                                                                • isBarrier flag (default: false)

                                                                                                                                                                                                                                  ResultTask is created\u00a0when:

                                                                                                                                                                                                                                  • DAGScheduler is requested to submit missing tasks of a ResultStage
                                                                                                                                                                                                                                  "},{"location":"scheduler/ResultTask/#running-task","title":"Running Task
                                                                                                                                                                                                                                  runTask(\n  context: TaskContext): U\n

                                                                                                                                                                                                                                  runTask\u00a0is part of the Task abstraction.

                                                                                                                                                                                                                                  runTask deserializes a RDD and a partition processing function from the broadcast variable (using the Closure Serializer).

                                                                                                                                                                                                                                  In the end, runTask executes the function (on the records from the partition of the RDD).

                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/Schedulable/","title":"Schedulable","text":"

                                                                                                                                                                                                                                  == [[Schedulable]] Schedulable Contract -- Schedulable Entities

                                                                                                                                                                                                                                  Schedulable is the <> of <> that manages the <> and can <>.

                                                                                                                                                                                                                                  [[contract]] .Schedulable Contract [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                                                                                  | addSchedulable a| [[addSchedulable]]

                                                                                                                                                                                                                                  "},{"location":"scheduler/Schedulable/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#addschedulableschedulable-schedulable-unit","title":"addSchedulable(schedulable: Schedulable): Unit","text":"

                                                                                                                                                                                                                                  Registers a <>

                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                  • FIFOSchedulableBuilder is requested to <>

                                                                                                                                                                                                                                  • FairSchedulableBuilder is requested to <>, <>, and <>

                                                                                                                                                                                                                                    | checkSpeculatableTasks a| [[checkSpeculatableTasks]]

                                                                                                                                                                                                                                    "},{"location":"scheduler/Schedulable/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#checkspeculatabletasksmintimetospeculation-int-boolean","title":"checkSpeculatableTasks(minTimeToSpeculation: Int): Boolean","text":"

                                                                                                                                                                                                                                    Used when...FIXME

                                                                                                                                                                                                                                    | executorLost a| [[executorLost]]

                                                                                                                                                                                                                                    "},{"location":"scheduler/Schedulable/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                                                                                                    executorLost( executorId: String, host: String, reason: ExecutorLossReason): Unit

                                                                                                                                                                                                                                    Handles an executor lost event

                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                    • Pool is requested to <>

                                                                                                                                                                                                                                    • TaskSchedulerImpl is requested to scheduler:TaskSchedulerImpl.md#removeExecutor[removeExecutor]

                                                                                                                                                                                                                                    • | getSchedulableByName a| [[getSchedulableByName]]

                                                                                                                                                                                                                                      "},{"location":"scheduler/Schedulable/#source-scala_3","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#getschedulablebynamename-string-schedulable","title":"getSchedulableByName(name: String): Schedulable","text":"

                                                                                                                                                                                                                                      Finds a <> by name

                                                                                                                                                                                                                                      Used when...FIXME

                                                                                                                                                                                                                                      | getSortedTaskSetQueue a| [[getSortedTaskSetQueue]]

                                                                                                                                                                                                                                      "},{"location":"scheduler/Schedulable/#source-scala_4","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#getsortedtasksetqueue-arraybuffertasksetmanager","title":"getSortedTaskSetQueue: ArrayBuffer[TaskSetManager]","text":"

                                                                                                                                                                                                                                      Builds a collection of scheduler:TaskSetManager.md[TaskSetManagers] sorted by <>

                                                                                                                                                                                                                                      Used when:

                                                                                                                                                                                                                                      • Pool is requested to <> (recursively)

                                                                                                                                                                                                                                      • TaskSchedulerImpl is requested to scheduler:TaskSchedulerImpl.md#resourceOffers[resourceOffers]

                                                                                                                                                                                                                                      • | minShare a| [[minShare]]

                                                                                                                                                                                                                                        "},{"location":"scheduler/Schedulable/#source-scala_5","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#minshare-int","title":"minShare: Int","text":"

                                                                                                                                                                                                                                        Used when...FIXME

                                                                                                                                                                                                                                        | name a| [[name]]

                                                                                                                                                                                                                                        "},{"location":"scheduler/Schedulable/#source-scala_6","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#name-string","title":"name: String","text":"

                                                                                                                                                                                                                                        Used when...FIXME

                                                                                                                                                                                                                                        | parent a| [[parent]]

                                                                                                                                                                                                                                        "},{"location":"scheduler/Schedulable/#source-scala_7","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#parent-pool","title":"parent: Pool","text":"

                                                                                                                                                                                                                                        Used when...FIXME

                                                                                                                                                                                                                                        | priority a| [[priority]]

                                                                                                                                                                                                                                        "},{"location":"scheduler/Schedulable/#source-scala_8","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#priority-int","title":"priority: Int","text":"

                                                                                                                                                                                                                                        Used when...FIXME

                                                                                                                                                                                                                                        | removeSchedulable a| [[removeSchedulable]]

                                                                                                                                                                                                                                        "},{"location":"scheduler/Schedulable/#source-scala_9","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#removeschedulableschedulable-schedulable-unit","title":"removeSchedulable(schedulable: Schedulable): Unit","text":"

                                                                                                                                                                                                                                        Used when...FIXME

                                                                                                                                                                                                                                        | runningTasks a| [[runningTasks]]

                                                                                                                                                                                                                                        "},{"location":"scheduler/Schedulable/#source-scala_10","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#runningtasks-int","title":"runningTasks: Int","text":"

                                                                                                                                                                                                                                        Used when...FIXME

                                                                                                                                                                                                                                        | schedulableQueue a| [[schedulableQueue]]

                                                                                                                                                                                                                                        "},{"location":"scheduler/Schedulable/#source-scala_11","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#schedulablequeue-concurrentlinkedqueueschedulable","title":"schedulableQueue: ConcurrentLinkedQueue[Schedulable]","text":"

                                                                                                                                                                                                                                        Queue of <> (as https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html[ConcurrentLinkedQueue])

                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                        • SparkContext is requested to SparkContext.md#getAllPools[getAllPools]

                                                                                                                                                                                                                                        • Pool is requested to <>, <>, <>, <>, <>, and <>

                                                                                                                                                                                                                                          | schedulingMode a| [[schedulingMode]]

                                                                                                                                                                                                                                          "},{"location":"scheduler/Schedulable/#source-scala_12","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#schedulingmode-schedulingmode","title":"schedulingMode: SchedulingMode","text":"

                                                                                                                                                                                                                                          <>

                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                          • Pool is <>

                                                                                                                                                                                                                                          • web UI's PoolTable is requested to render a page with pools (poolRow)

                                                                                                                                                                                                                                          • | stageId a| [[stageId]]

                                                                                                                                                                                                                                            "},{"location":"scheduler/Schedulable/#source-scala_13","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#stageid-int","title":"stageId: Int","text":"

                                                                                                                                                                                                                                            Used when...FIXME

                                                                                                                                                                                                                                            | weight a| [[weight]]

                                                                                                                                                                                                                                            "},{"location":"scheduler/Schedulable/#source-scala_14","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#weight-int","title":"weight: Int","text":"

                                                                                                                                                                                                                                            Used when...FIXME

                                                                                                                                                                                                                                            |===

                                                                                                                                                                                                                                            [[implementations]] .Schedulables [cols=\"1,3\",options=\"header\",width=\"100%\"] |=== | Schedulable | Description

                                                                                                                                                                                                                                            | <> | [[Pool]] Pool of <> (i.e. a recursive data structure for prioritizing task sets)

                                                                                                                                                                                                                                            | scheduler:TaskSetManager.md[TaskSetManager] | [[TaskSetManager]] Manages scheduling of tasks of a scheduler:TaskSet.md[TaskSet]

                                                                                                                                                                                                                                            |===

                                                                                                                                                                                                                                            "},{"location":"scheduler/SchedulableBuilder/","title":"SchedulableBuilder","text":"

                                                                                                                                                                                                                                            == [[SchedulableBuilder]] SchedulableBuilder Contract -- Builders of Schedulable Pools

                                                                                                                                                                                                                                            SchedulableBuilder is the <> of <> that manage a <>, which is to <> and <>.

                                                                                                                                                                                                                                            SchedulableBuilder is a private[spark] Scala trait that is used exclusively by scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl] (the default Spark scheduler). When requested to scheduler:TaskSchedulerImpl.md#initialize[initialize], TaskSchedulerImpl uses the configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property (default: FIFO) to select one of the <>.

                                                                                                                                                                                                                                            [[contract]] .SchedulableBuilder Contract [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                                                                                            | addTaskSetManager a| [[addTaskSetManager]]

                                                                                                                                                                                                                                            "},{"location":"scheduler/SchedulableBuilder/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/SchedulableBuilder/#addtasksetmanagermanager-schedulable-properties-properties-unit","title":"addTaskSetManager(manager: Schedulable, properties: Properties): Unit","text":"

                                                                                                                                                                                                                                            Registers a new <> with the <>

                                                                                                                                                                                                                                            Used exclusively when TaskSchedulerImpl is requested to scheduler:TaskSchedulerImpl.md#submitTasks[submit tasks (of TaskSet) for execution] (and registers a new scheduler:TaskSetManager.md[TaskSetManager] for the TaskSet)

                                                                                                                                                                                                                                            | buildPools a| [[buildPools]]

                                                                                                                                                                                                                                            "},{"location":"scheduler/SchedulableBuilder/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/SchedulableBuilder/#buildpools-unit","title":"buildPools(): Unit","text":"

                                                                                                                                                                                                                                            Builds a tree of <>

                                                                                                                                                                                                                                            Used exclusively when TaskSchedulerImpl is requested to scheduler:TaskSchedulerImpl.md#initialize[initialize] (and creates a scheduler:TaskSchedulerImpl.md#schedulableBuilder[SchedulableBuilder] per configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property)

                                                                                                                                                                                                                                            | rootPool a| [[rootPool]]

                                                                                                                                                                                                                                            "},{"location":"scheduler/SchedulableBuilder/#source-scala_2","title":"[source, scala]","text":""},{"location":"scheduler/SchedulableBuilder/#rootpool-pool","title":"rootPool: Pool","text":"

                                                                                                                                                                                                                                            Root (top-level) <>

                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                            • FIFOSchedulableBuilder is requested to <>

                                                                                                                                                                                                                                            • FairSchedulableBuilder is requested to <>, <>, and <>

                                                                                                                                                                                                                                              |===

                                                                                                                                                                                                                                              [[implementations]] .SchedulableBuilders [cols=\"1,3\",options=\"header\",width=\"100%\"] |=== | SchedulableBuilder | Description

                                                                                                                                                                                                                                              | <> | [[FairSchedulableBuilder]] Used when the configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property is FAIR

                                                                                                                                                                                                                                              | <> | [[FIFOSchedulableBuilder]] Default SchedulableBuilder that is used when the configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property is FIFO (default)

                                                                                                                                                                                                                                              |===

                                                                                                                                                                                                                                              "},{"location":"scheduler/SchedulerBackend/","title":"SchedulerBackend","text":"

                                                                                                                                                                                                                                              SchedulerBackend is an abstraction of task scheduling backends that can revive resource offers from cluster managers.

                                                                                                                                                                                                                                              SchedulerBackend abstraction allows TaskSchedulerImpl to use variety of cluster managers (with their own resource offers and task scheduling modes).

                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                              Being a scheduler backend system assumes a Apache Mesos-like scheduling model in which \"an application\" gets resource offers as machines become available so it is possible to launch tasks on them. Once required resource allocation is obtained, the scheduler backend can start executors.

                                                                                                                                                                                                                                              "},{"location":"scheduler/SchedulerBackend/#contract","title":"Contract","text":""},{"location":"scheduler/SchedulerBackend/#applicationattemptid","title":"applicationAttemptId
                                                                                                                                                                                                                                              applicationAttemptId(): Option[String]\n

                                                                                                                                                                                                                                              Execution attempt ID of this Spark application

                                                                                                                                                                                                                                              Default: None (undefined)

                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                              • TaskSchedulerImpl is requested for the execution attempt ID of a Spark application
                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/SchedulerBackend/#applicationid","title":"applicationId
                                                                                                                                                                                                                                              applicationId(): String\n

                                                                                                                                                                                                                                              Unique identifier of this Spark application

                                                                                                                                                                                                                                              Default: spark-application-[currentTimeMillis]

                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                              • TaskSchedulerImpl is requested for the unique identifier of a Spark application
                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/SchedulerBackend/#default-parallelism","title":"Default Parallelism
                                                                                                                                                                                                                                              defaultParallelism(): Int\n

                                                                                                                                                                                                                                              Default parallelism, i.e. a hint for the number of tasks in stages while sizing jobs

                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                              • TaskSchedulerImpl is requested for the default parallelism
                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/SchedulerBackend/#getdriverattributes","title":"getDriverAttributes
                                                                                                                                                                                                                                              getDriverAttributes: Option[Map[String, String]]\n

                                                                                                                                                                                                                                              Default: None

                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                              • SparkContext is requested to postApplicationStart
                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/SchedulerBackend/#getdriverlogurls","title":"getDriverLogUrls
                                                                                                                                                                                                                                              getDriverLogUrls: Option[Map[String, String]]\n

                                                                                                                                                                                                                                              Driver log URLs

                                                                                                                                                                                                                                              Default: None (undefined)

                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                              • SparkContext is requested to postApplicationStart
                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/SchedulerBackend/#isready","title":"isReady
                                                                                                                                                                                                                                              isReady(): Boolean\n

                                                                                                                                                                                                                                              Controls whether this SchedulerBackend is ready (true) or not (false)

                                                                                                                                                                                                                                              Default: true

                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                              • TaskSchedulerImpl is requested to wait until scheduling backend is ready
                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/SchedulerBackend/#killing-task","title":"Killing Task
                                                                                                                                                                                                                                              killTask(\n  taskId: Long,\n  executorId: String,\n  interruptThread: Boolean,\n  reason: String): Unit\n

                                                                                                                                                                                                                                              Kills a given task

                                                                                                                                                                                                                                              Default: UnsupportedOperationException

                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                              • TaskSchedulerImpl is requested to killTaskAttempt and killAllTaskAttempts
                                                                                                                                                                                                                                              • TaskSetManager is requested to handle a successful task attempt
                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/SchedulerBackend/#maxNumConcurrentTasks","title":"Maximum Number of Concurrent Tasks
                                                                                                                                                                                                                                              maxNumConcurrentTasks(\n  rp: ResourceProfile): Int\n

                                                                                                                                                                                                                                              Maximum number of concurrent tasks that can be launched (based on the given ResourceProfile)

                                                                                                                                                                                                                                              See:

                                                                                                                                                                                                                                              • CoarseGrainedSchedulerBackend
                                                                                                                                                                                                                                              • LocalSchedulerBackend

                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                              • SparkContext is requested for the maximum number of concurrent tasks
                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/SchedulerBackend/#reviveoffers","title":"reviveOffers
                                                                                                                                                                                                                                              reviveOffers(): Unit\n

                                                                                                                                                                                                                                              Handles resource allocation offers (from the scheduling system)

                                                                                                                                                                                                                                              Used when TaskSchedulerImpl is requested to:

                                                                                                                                                                                                                                              • Submit tasks (from a TaskSet)

                                                                                                                                                                                                                                              • Handle a task status update

                                                                                                                                                                                                                                              • Notify the TaskSetManager that a task has failed

                                                                                                                                                                                                                                              • Check for speculatable tasks

                                                                                                                                                                                                                                              • Handle a lost executor event

                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/SchedulerBackend/#starting-schedulerbackend","title":"Starting SchedulerBackend
                                                                                                                                                                                                                                              start(): Unit\n

                                                                                                                                                                                                                                              Starts this SchedulerBackend

                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                              • TaskSchedulerImpl is requested to start
                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/SchedulerBackend/#stop","title":"stop
                                                                                                                                                                                                                                              stop(): Unit\n

                                                                                                                                                                                                                                              Stops this SchedulerBackend

                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                              • TaskSchedulerImpl is requested to stop
                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/SchedulerBackend/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                              • CoarseGrainedSchedulerBackend
                                                                                                                                                                                                                                              • LocalSchedulerBackend
                                                                                                                                                                                                                                              • MesosFineGrainedSchedulerBackend
                                                                                                                                                                                                                                              "},{"location":"scheduler/SchedulerBackendUtils/","title":"SchedulerBackendUtils Utility","text":""},{"location":"scheduler/SchedulerBackendUtils/#default-number-of-executors","title":"Default Number of Executors

                                                                                                                                                                                                                                              SchedulerBackendUtils defaults to 2 as the default number of executors.

                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/SchedulerBackendUtils/#getinitialtargetexecutornumber","title":"getInitialTargetExecutorNumber
                                                                                                                                                                                                                                              getInitialTargetExecutorNumber(\n  conf: SparkConf,\n  numExecutors: Int = DEFAULT_NUMBER_EXECUTORS): Int\n

                                                                                                                                                                                                                                              getInitialTargetExecutorNumber branches off based on whether Dynamic Allocation of Executors is enabled or not.

                                                                                                                                                                                                                                              With no Dynamic Allocation of Executors, getInitialTargetExecutorNumber uses the spark.executor.instances configuration property (if defined) or uses the given numExecutors (and the DEFAULT_NUMBER_EXECUTORS).

                                                                                                                                                                                                                                              With Dynamic Allocation of Executors enabled, getInitialTargetExecutorNumber getDynamicAllocationInitialExecutors and makes sure that the value is between the following configuration properties:

                                                                                                                                                                                                                                              • spark.dynamicAllocation.minExecutors
                                                                                                                                                                                                                                              • spark.dynamicAllocation.maxExecutors

                                                                                                                                                                                                                                              getInitialTargetExecutorNumber is used when:

                                                                                                                                                                                                                                              • KubernetesClusterSchedulerBackend (Spark on Kubernetes) is created
                                                                                                                                                                                                                                              • Spark on YARN's YarnAllocator, YarnClientSchedulerBackend and YarnClusterSchedulerBackend are used
                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/SchedulingMode/","title":"SchedulingMode","text":"

                                                                                                                                                                                                                                              == [[SchedulingMode]] Scheduling Mode -- spark.scheduler.mode Spark Property

                                                                                                                                                                                                                                              Scheduling Mode (aka order task policy or scheduling policy or scheduling order) defines a policy to sort tasks in order for execution.

                                                                                                                                                                                                                                              The scheduling mode schedulingMode attribute is part of the scheduler:TaskScheduler.md#schedulingMode[TaskScheduler Contract].

                                                                                                                                                                                                                                              The only implementation of the TaskScheduler contract in Spark -- scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl] -- uses configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] setting to configure schedulingMode that is merely used to set up the scheduler:TaskScheduler.md#rootPool[rootPool] attribute (with FIFO being the default). It happens when scheduler:TaskSchedulerImpl.md#initialize[TaskSchedulerImpl is initialized].

                                                                                                                                                                                                                                              There are three acceptable scheduling modes:

                                                                                                                                                                                                                                              • [[FIFO]] FIFO with no pools but a single top-level unnamed pool with elements being scheduler:TaskSetManager.md[TaskSetManager] objects; lower priority gets scheduler:spark-scheduler-Schedulable.md[Schedulable] sooner or earlier stage wins.
                                                                                                                                                                                                                                              • [[FAIR]] FAIR with a scheduler:spark-scheduler-FairSchedulableBuilder.md#buildPools[hierarchy of Schedulable (sub)pools] with the scheduler:TaskScheduler.md#rootPool[rootPool] at the top.
                                                                                                                                                                                                                                              • [[NONE]] NONE (not used)

                                                                                                                                                                                                                                              NOTE: Out of three possible SchedulingMode policies only FIFO and FAIR modes are supported by scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl].

                                                                                                                                                                                                                                              "},{"location":"scheduler/SchedulingMode/#note","title":"[NOTE]","text":"

                                                                                                                                                                                                                                              After the root pool is initialized, the scheduling mode is no longer relevant (since the spark-scheduler-Schedulable.md[Schedulable] that represents the root pool is fully set up).

                                                                                                                                                                                                                                              "},{"location":"scheduler/SchedulingMode/#the-root-pool-is-later-used-when-schedulertaskschedulerimplmdsubmittaskstaskschedulerimpl-submits-tasks-as-tasksets-for-execution","title":"The root pool is later used when scheduler:TaskSchedulerImpl.md#submitTasks[TaskSchedulerImpl submits tasks (as TaskSets) for execution].","text":"

                                                                                                                                                                                                                                              NOTE: The scheduler:TaskScheduler.md#rootPool[root pool] is a Schedulable. Refer to spark-scheduler-Schedulable.md[Schedulable].

                                                                                                                                                                                                                                              === [[fair-scheduling-sparkui]] Monitoring FAIR Scheduling Mode using Spark UI

                                                                                                                                                                                                                                              CAUTION: FIXME Describe me...

                                                                                                                                                                                                                                              "},{"location":"scheduler/ShuffleMapStage/","title":"ShuffleMapStage","text":"

                                                                                                                                                                                                                                              ShuffleMapStage (shuffle map stage or simply map stage) is a Stage.

                                                                                                                                                                                                                                              ShuffleMapStage corresponds to (and is associated with) a ShuffleDependency.

                                                                                                                                                                                                                                              ShuffleMapStage can be submitted independently but it is usually an intermediate step in a physical execution plan (with the final step being a ResultStage).

                                                                                                                                                                                                                                              "},{"location":"scheduler/ShuffleMapStage/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                              ShuffleMapStage takes the following to be created:

                                                                                                                                                                                                                                              • Stage ID
                                                                                                                                                                                                                                              • RDD (of the ShuffleDependency)
                                                                                                                                                                                                                                              • Number of tasks
                                                                                                                                                                                                                                              • Parent Stages
                                                                                                                                                                                                                                              • First Job ID (of the ActiveJob that created it)
                                                                                                                                                                                                                                              • CallSite
                                                                                                                                                                                                                                              • ShuffleDependency
                                                                                                                                                                                                                                              • MapOutputTrackerMaster
                                                                                                                                                                                                                                              • Resource Profile ID

                                                                                                                                                                                                                                                ShuffleMapStage is created when:

                                                                                                                                                                                                                                                • DAGScheduler is requested to plan a ShuffleDependency for execution
                                                                                                                                                                                                                                                "},{"location":"scheduler/ShuffleMapStage/#missing-partitions","title":"Missing Partitions
                                                                                                                                                                                                                                                findMissingPartitions(): Seq[Int]\n

                                                                                                                                                                                                                                                findMissingPartitions requests the MapOutputTrackerMaster for the missing partitions (of the ShuffleDependency) and returns them.

                                                                                                                                                                                                                                                If not available (MapOutputTrackerMaster does not track the ShuffleDependency), findMissingPartitions simply assumes that all the partitions are missing.

                                                                                                                                                                                                                                                findMissingPartitions is part of the Stage abstraction.

                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/ShuffleMapStage/#shufflemapstage-ready","title":"ShuffleMapStage Ready

                                                                                                                                                                                                                                                When \"executed\", a ShuffleMapStage saves map output files (for reduce tasks).

                                                                                                                                                                                                                                                When all partitions have shuffle map outputs available, ShuffleMapStage is considered ready (done or available).

                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/ShuffleMapStage/#isavailable","title":"isAvailable
                                                                                                                                                                                                                                                isAvailable: Boolean\n

                                                                                                                                                                                                                                                isAvailable is true when the ShuffleMapStage is ready and all partitions have shuffle outputs (i.e. the numAvailableOutputs is exactly the numPartitions).

                                                                                                                                                                                                                                                isAvailable is used when:

                                                                                                                                                                                                                                                • DAGScheduler is requested to getMissingParentStages, handleMapStageSubmitted, submitMissingTasks, processShuffleMapStageCompletion, markMapStageJobsAsFinished and stageDependsOn
                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/ShuffleMapStage/#available-outputs","title":"Available Outputs
                                                                                                                                                                                                                                                numAvailableOutputs: Int\n

                                                                                                                                                                                                                                                numAvailableOutputs requests the MapOutputTrackerMaster to getNumAvailableOutputs (for the shuffleId of the ShuffleDependency).

                                                                                                                                                                                                                                                numAvailableOutputs is used when:

                                                                                                                                                                                                                                                • DAGScheduler is requested to submitMissingTasks
                                                                                                                                                                                                                                                • ShuffleMapStage is requested to isAvailable
                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/ShuffleMapStage/#active-jobs","title":"Active Jobs

                                                                                                                                                                                                                                                ShuffleMapStage defines _mapStageJobs internal registry of ActiveJobs to track jobs that were submitted to execute the stage independently.

                                                                                                                                                                                                                                                A new job is registered (added) in addActiveJob.

                                                                                                                                                                                                                                                An active job is deregistered (removed) in removeActiveJob.

                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/ShuffleMapStage/#addactivejob","title":"addActiveJob
                                                                                                                                                                                                                                                addActiveJob(\n  job: ActiveJob): Unit\n

                                                                                                                                                                                                                                                addActiveJob adds the given ActiveJob to (the front of) the _mapStageJobs list.

                                                                                                                                                                                                                                                addActiveJob is used when:

                                                                                                                                                                                                                                                • DAGScheduler is requested to handleMapStageSubmitted
                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/ShuffleMapStage/#removeactivejob","title":"removeActiveJob
                                                                                                                                                                                                                                                removeActiveJob(\n  job: ActiveJob): Unit\n

                                                                                                                                                                                                                                                removeActiveJob removes the ActiveJob from the _mapStageJobs registry.

                                                                                                                                                                                                                                                removeActiveJob is used when:

                                                                                                                                                                                                                                                • DAGScheduler is requested to cleanupStateForJobAndIndependentStages
                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/ShuffleMapStage/#mapstagejobs","title":"mapStageJobs
                                                                                                                                                                                                                                                mapStageJobs: Seq[ActiveJob]\n

                                                                                                                                                                                                                                                mapStageJobs returns the _mapStageJobs list.

                                                                                                                                                                                                                                                mapStageJobs is used when:

                                                                                                                                                                                                                                                • DAGScheduler is requested to markMapStageJobsAsFinished
                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/ShuffleMapStage/#demo-shufflemapstage-sharing","title":"Demo: ShuffleMapStage Sharing

                                                                                                                                                                                                                                                A ShuffleMapStage can be shared across multiple jobs (if these jobs reuse the same RDDs).

                                                                                                                                                                                                                                                val keyValuePairs = sc.parallelize(0 to 5).map((_, 1))\nval rdd = keyValuePairs.sortByKey()  // (1)\n\nscala> println(rdd.toDebugString)\n(6) ShuffledRDD[4] at sortByKey at <console>:39 []\n +-(16) MapPartitionsRDD[1] at map at <console>:39 []\n    |   ParallelCollectionRDD[0] at parallelize at <console>:39 []\n\nrdd.count  // (2)\nrdd.count  // (3)\n
                                                                                                                                                                                                                                                1. Shuffle at sortByKey()
                                                                                                                                                                                                                                                2. Submits a job with two stages (and two to be executed)
                                                                                                                                                                                                                                                3. Intentionally repeat the last action that submits a new job with two stages with one being shared as already-computed
                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/ShuffleMapStage/#map-output-files","title":"Map Output Files

                                                                                                                                                                                                                                                ShuffleMapStage writes out map output files (for a shuffle).

                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/ShuffleMapTask/","title":"ShuffleMapTask","text":"

                                                                                                                                                                                                                                                ShuffleMapTask is a Task to produce a MapStatus (Task[MapStatus]).

                                                                                                                                                                                                                                                ShuffleMapTask is one of the two types of Tasks. When executed, ShuffleMapTask writes the result of executing a serialized task code over the records (of a RDD partition) to the shuffle system and returns a MapStatus (with the BlockManager and estimated size of the result shuffle blocks).

                                                                                                                                                                                                                                                "},{"location":"scheduler/ShuffleMapTask/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                ShuffleMapTask takes the following to be created:

                                                                                                                                                                                                                                                • Stage ID
                                                                                                                                                                                                                                                • Stage Attempt ID
                                                                                                                                                                                                                                                • Broadcast variable with a serialized task binary
                                                                                                                                                                                                                                                • Partition
                                                                                                                                                                                                                                                • TaskLocations
                                                                                                                                                                                                                                                • Local Properties
                                                                                                                                                                                                                                                • Serialized task metrics
                                                                                                                                                                                                                                                • Job ID (default: None)
                                                                                                                                                                                                                                                • Application ID (default: None)
                                                                                                                                                                                                                                                • Application Attempt ID (default: None)
                                                                                                                                                                                                                                                • isBarrier flag
                                                                                                                                                                                                                                                • ShuffleMapTask is created when DAGScheduler is requested to submit tasks for all missing partitions of a ShuffleMapStage.

                                                                                                                                                                                                                                                  "},{"location":"scheduler/ShuffleMapTask/#isBarrier","title":"isBarrier Flag","text":"

                                                                                                                                                                                                                                                  ShuffleMapTask can be given isBarrier flag when created. Unless given, isBarrier is assumed disabled (false).

                                                                                                                                                                                                                                                  isBarrier flag is passed to the parent Task.

                                                                                                                                                                                                                                                  "},{"location":"scheduler/ShuffleMapTask/#serialized-task-binary","title":"Serialized Task Binary
                                                                                                                                                                                                                                                  taskBinary: Broadcast[Array[Byte]]\n

                                                                                                                                                                                                                                                  ShuffleMapTask is given a broadcast variable with a reference to a serialized task binary.

                                                                                                                                                                                                                                                  runTask expects that the serialized task binary is a tuple of an RDD and a ShuffleDependency.

                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/ShuffleMapTask/#preferred-locations","title":"Preferred Locations Signature
                                                                                                                                                                                                                                                  preferredLocations: Seq[TaskLocation]\n

                                                                                                                                                                                                                                                  preferredLocations is part of the Task abstraction.

                                                                                                                                                                                                                                                  preferredLocations returns preferredLocs internal property.

                                                                                                                                                                                                                                                  ShuffleMapTask tracks TaskLocations as unique entries in the given locs (with the only rule that when locs is not defined, it is empty, and no task location preferences are defined).

                                                                                                                                                                                                                                                  ShuffleMapTask initializes the preferredLocs internal property when created

                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/ShuffleMapTask/#running-task","title":"Running Task Signature
                                                                                                                                                                                                                                                  runTask(\n  context: TaskContext): MapStatus\n

                                                                                                                                                                                                                                                  runTask is part of the Task abstraction.

                                                                                                                                                                                                                                                  runTask writes the result (records) of executing the serialized task code over the records (in the RDD partition) to the shuffle system and returns a MapStatus (with the BlockManager and an estimated size of the result shuffle blocks).

                                                                                                                                                                                                                                                  Internally, runTask requests the SparkEnv for the new instance of closure serializer and requests it to deserialize the serialized task code (into a tuple of a RDD and a ShuffleDependency).

                                                                                                                                                                                                                                                  runTask measures the thread and CPU deserialization times.

                                                                                                                                                                                                                                                  runTask requests the SparkEnv for the ShuffleManager and requests it for a ShuffleWriter (for the ShuffleHandle and the partition).

                                                                                                                                                                                                                                                  runTask then requests the RDD for the records (of the partition) that the ShuffleWriter is requested to write out (to the shuffle system).

                                                                                                                                                                                                                                                  In the end, runTask requests the ShuffleWriter to stop (with the success flag on) and returns the shuffle map output status.

                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                  This is the moment in Task's lifecycle (and its corresponding RDD) when a RDD partition is computed and in turn becomes a sequence of records (i.e. real data) on an executor.

                                                                                                                                                                                                                                                  In case of any exceptions, runTask requests the ShuffleWriter to stop (with the success flag off) and (re)throws the exception.

                                                                                                                                                                                                                                                  runTask may also print out the following DEBUG message to the logs when the ShuffleWriter could not be stopped.

                                                                                                                                                                                                                                                  Could not stop writer\n
                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/ShuffleMapTask/#logging","title":"Logging

                                                                                                                                                                                                                                                  Enable ALL logging level for org.apache.spark.scheduler.ShuffleMapTask logger to see what happens inside.

                                                                                                                                                                                                                                                  Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                  logger.ShuffleMapTask.name = org.apache.spark.scheduler.ShuffleMapTask\nlogger.ShuffleMapTask.level = all\n

                                                                                                                                                                                                                                                  Refer to Logging.

                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/ShuffleStatus/","title":"ShuffleStatus","text":"

                                                                                                                                                                                                                                                  ShuffleStatus is a registry of MapStatuses per Partition of a ShuffleMapStage.

                                                                                                                                                                                                                                                  ShuffleStatus is used by MapOutputTrackerMaster.

                                                                                                                                                                                                                                                  "},{"location":"scheduler/ShuffleStatus/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                  ShuffleStatus takes the following to be created:

                                                                                                                                                                                                                                                  • Number of Partitions (of the RDD of the ShuffleDependency of a ShuffleMapStage)

                                                                                                                                                                                                                                                    ShuffleStatus is created\u00a0when:

                                                                                                                                                                                                                                                    • MapOutputTrackerMaster is requested to register a shuffle (when DAGScheduler is requested to create a ShuffleMapStage)
                                                                                                                                                                                                                                                    "},{"location":"scheduler/ShuffleStatus/#mapstatuses-per-partition","title":"MapStatuses per Partition

                                                                                                                                                                                                                                                    ShuffleStatus creates a mapStatuses internal registry of MapStatuses per partition (using the numPartitions) when created.

                                                                                                                                                                                                                                                    A missing partition is when there is no MapStatus for a partition (null at the index of the partition ID) and can be requested using findMissingPartitions.

                                                                                                                                                                                                                                                    mapStatuses is all null (for every partition) initially (and so all partitions are missing / uncomputed).

                                                                                                                                                                                                                                                    A new MapStatus is added in addMapOutput and updateMapOutput.

                                                                                                                                                                                                                                                    A MapStatus is removed (nulled) in removeMapOutput and removeOutputsByFilter.

                                                                                                                                                                                                                                                    The number of available MapStatuses is tracked by _numAvailableMapOutputs internal counter.

                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                    • serializedMapStatus and withMapStatuses
                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/ShuffleStatus/#registering-shuffle-map-output","title":"Registering Shuffle Map Output
                                                                                                                                                                                                                                                    addMapOutput(\n  mapIndex: Int,\n  status: MapStatus): Unit\n

                                                                                                                                                                                                                                                    addMapOutput adds the MapStatus to the mapStatuses internal registry.

                                                                                                                                                                                                                                                    In case the mapStatuses internal registry had no MapStatus for the mapIndex already available, addMapOutput increments the _numAvailableMapOutputs internal counter and invalidateSerializedMapOutputStatusCache.

                                                                                                                                                                                                                                                    addMapOutput\u00a0is used when:

                                                                                                                                                                                                                                                    • MapOutputTrackerMaster is requested to registerMapOutput
                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/ShuffleStatus/#deregistering-shuffle-map-output","title":"Deregistering Shuffle Map Output
                                                                                                                                                                                                                                                    removeMapOutput(\n  mapIndex: Int,\n  bmAddress: BlockManagerId): Unit\n

                                                                                                                                                                                                                                                    removeMapOutput...FIXME

                                                                                                                                                                                                                                                    removeMapOutput\u00a0is used when:

                                                                                                                                                                                                                                                    • MapOutputTrackerMaster is requested to unregisterMapOutput
                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/ShuffleStatus/#missing-partitions","title":"Missing Partitions
                                                                                                                                                                                                                                                    findMissingPartitions(): Seq[Int]\n

                                                                                                                                                                                                                                                    findMissingPartitions...FIXME

                                                                                                                                                                                                                                                    findMissingPartitions\u00a0is used when:

                                                                                                                                                                                                                                                    • MapOutputTrackerMaster is requested to findMissingPartitions
                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/ShuffleStatus/#serializing-shuffle-map-output-statuses","title":"Serializing Shuffle Map Output Statuses
                                                                                                                                                                                                                                                    serializedMapStatus(\n  broadcastManager: BroadcastManager,\n  isLocal: Boolean,\n  minBroadcastSize: Int,\n  conf: SparkConf): Array[Byte]\n

                                                                                                                                                                                                                                                    serializedMapStatus...FIXME

                                                                                                                                                                                                                                                    serializedMapStatus\u00a0is used when:

                                                                                                                                                                                                                                                    • MessageLoop (of the MapOutputTrackerMaster) is requested to send map output locations for shuffle
                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/ShuffleStatus/#logging","title":"Logging

                                                                                                                                                                                                                                                    Enable ALL logging level for org.apache.spark.ShuffleStatus logger to see what happens inside.

                                                                                                                                                                                                                                                    Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                    log4j.logger.org.apache.spark.ShuffleStatus=ALL\n

                                                                                                                                                                                                                                                    Refer to Logging.

                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/Stage/","title":"Stage","text":"

                                                                                                                                                                                                                                                    Stage is an abstraction of steps in a physical execution plan.

                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                    The logical DAG or logical execution plan is the RDD lineage.

                                                                                                                                                                                                                                                    Indirectly, a Stage is a set of parallel tasks - one task per partition (of an RDD that computes partial results of a function executed as part of a Spark job).

                                                                                                                                                                                                                                                    In other words, a Spark job is a computation \"sliced\" (not to use the reserved term partitioned) into stages.

                                                                                                                                                                                                                                                    "},{"location":"scheduler/Stage/#contract","title":"Contract","text":""},{"location":"scheduler/Stage/#missing-partitions","title":"Missing Partitions
                                                                                                                                                                                                                                                    findMissingPartitions(): Seq[Int]\n

                                                                                                                                                                                                                                                    Missing partitions (IDs of the partitions of the RDD that are missing and need to be computed)

                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                    • DAGScheduler is requested to submit missing tasks
                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/Stage/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                    • ResultStage
                                                                                                                                                                                                                                                    • ShuffleMapStage
                                                                                                                                                                                                                                                    "},{"location":"scheduler/Stage/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                    Stage takes the following to be created:

                                                                                                                                                                                                                                                    • Stage ID
                                                                                                                                                                                                                                                    • RDD
                                                                                                                                                                                                                                                    • Number of tasks
                                                                                                                                                                                                                                                    • Parent Stages
                                                                                                                                                                                                                                                    • First Job ID
                                                                                                                                                                                                                                                    • CallSite
                                                                                                                                                                                                                                                    • Resource Profile ID

                                                                                                                                                                                                                                                      Abstract Class

                                                                                                                                                                                                                                                      Stage is an abstract class and cannot be created directly. It is created indirectly for the concrete Stages.

                                                                                                                                                                                                                                                      "},{"location":"scheduler/Stage/#rdd","title":"RDD

                                                                                                                                                                                                                                                      Stage is given a RDD when created.

                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/Stage/#stage-id","title":"Stage ID

                                                                                                                                                                                                                                                      Stage is given an unique ID when created.

                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                      DAGScheduler uses nextStageId internal counter to track the number of stage submissions.

                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/Stage/#making-new-stage-attempt","title":"Making New Stage Attempt
                                                                                                                                                                                                                                                      makeNewStageAttempt(\n  numPartitionsToCompute: Int,\n  taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty): Unit\n

                                                                                                                                                                                                                                                      makeNewStageAttempt creates a new TaskMetrics and requests it to register itself with the SparkContext of the RDD.

                                                                                                                                                                                                                                                      makeNewStageAttempt creates a StageInfo from this Stage (and the nextAttemptId). This StageInfo is saved in the _latestInfo internal registry.

                                                                                                                                                                                                                                                      In the end, makeNewStageAttempt increments the nextAttemptId internal counter.

                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                      makeNewStageAttempt returns Unit (nothing) and its purpose is to update the latest StageInfo internal registry.

                                                                                                                                                                                                                                                      makeNewStageAttempt\u00a0is used when:

                                                                                                                                                                                                                                                      • DAGScheduler is requested to submit the missing tasks of a stage
                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/StageInfo/","title":"StageInfo","text":"

                                                                                                                                                                                                                                                      StageInfo is a metadata about a stage to pass from the scheduler to SparkListeners.

                                                                                                                                                                                                                                                      "},{"location":"scheduler/StageInfo/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                      StageInfo takes the following to be created:

                                                                                                                                                                                                                                                      • Stage ID
                                                                                                                                                                                                                                                      • Stage Attempt ID
                                                                                                                                                                                                                                                      • Name
                                                                                                                                                                                                                                                      • Number of Tasks
                                                                                                                                                                                                                                                      • RDDInfos
                                                                                                                                                                                                                                                      • Parent IDs
                                                                                                                                                                                                                                                      • Details
                                                                                                                                                                                                                                                      • TaskMetrics (default: null)
                                                                                                                                                                                                                                                      • Task Locality Preferences (default: empty)
                                                                                                                                                                                                                                                      • Optional Shuffle Dependency ID (default: undefined)

                                                                                                                                                                                                                                                        StageInfo is created\u00a0when:

                                                                                                                                                                                                                                                        • StageInfo utility is used to fromStage
                                                                                                                                                                                                                                                        • JsonProtocol (History Server) is used to stageInfoFromJson
                                                                                                                                                                                                                                                        "},{"location":"scheduler/StageInfo/#fromstage-utility","title":"fromStage Utility
                                                                                                                                                                                                                                                        fromStage(\n  stage: Stage,\n  attemptId: Int,\n  numTasks: Option[Int] = None,\n  taskMetrics: TaskMetrics = null,\n  taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty): StageInfo\n

                                                                                                                                                                                                                                                        fromStage...FIXME

                                                                                                                                                                                                                                                        fromStage\u00a0is used when:

                                                                                                                                                                                                                                                        • Stage is created and make a new Stage attempt
                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/Task/","title":"Task","text":"

                                                                                                                                                                                                                                                        Task is an abstraction of the smallest individual units of execution that can be executed (to compute an RDD partition).

                                                                                                                                                                                                                                                        "},{"location":"scheduler/Task/#contract","title":"Contract","text":""},{"location":"scheduler/Task/#running-task","title":"Running Task
                                                                                                                                                                                                                                                        runTask(\n  context: TaskContext): T\n

                                                                                                                                                                                                                                                        Runs the task (in a TaskContext)

                                                                                                                                                                                                                                                        Used when Task is requested to run

                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/Task/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                        • ResultTask
                                                                                                                                                                                                                                                        • ShuffleMapTask
                                                                                                                                                                                                                                                        "},{"location":"scheduler/Task/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                        Task takes the following to be created:

                                                                                                                                                                                                                                                        • Stage ID
                                                                                                                                                                                                                                                        • Stage (execution) Attempt ID
                                                                                                                                                                                                                                                        • Partition ID to compute
                                                                                                                                                                                                                                                        • Local Properties
                                                                                                                                                                                                                                                        • Serialized TaskMetrics (Array[Byte])
                                                                                                                                                                                                                                                        • ActiveJob ID (default: None)
                                                                                                                                                                                                                                                        • Application ID (default: None)
                                                                                                                                                                                                                                                        • Application Attempt ID (default: None)
                                                                                                                                                                                                                                                        • isBarrier flag
                                                                                                                                                                                                                                                        • Task is created when:

                                                                                                                                                                                                                                                          • DAGScheduler is requested to submit missing tasks of a stage
                                                                                                                                                                                                                                                          Abstract Class

                                                                                                                                                                                                                                                          Task\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete Tasks.

                                                                                                                                                                                                                                                          "},{"location":"scheduler/Task/#isBarrier","title":"isBarrier Flag

                                                                                                                                                                                                                                                          Task can be given isBarrier flag when created. Unless given, isBarrier is assumed disabled (false).

                                                                                                                                                                                                                                                          isBarrier flag indicates whether this Task belongs to a Barrier Stage in Barrier Execution Mode.

                                                                                                                                                                                                                                                          isBarrier flag is used when:

                                                                                                                                                                                                                                                          • DAGScheduler is requested to handleTaskCompletion (of a FetchFailed task) to fail the parent stage (and retry a barrier stage when one of the barrier tasks fails)
                                                                                                                                                                                                                                                          • Task is requested to run (to create a BarrierTaskContext)
                                                                                                                                                                                                                                                          • TaskSetManager is requested to isBarrier and handleFailedTask
                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/Task/#taskmemorymanager","title":"TaskMemoryManager

                                                                                                                                                                                                                                                          Task is given a TaskMemoryManager when TaskRunner is requested to run a task (right after deserializing the task for execution).

                                                                                                                                                                                                                                                          Task uses the TaskMemoryManager to create a TaskContextImpl (when requested to run).

                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/Task/#serializable","title":"Serializable

                                                                                                                                                                                                                                                          Task is a Serializable (Java) so it can be serialized (to bytes) and send over the wire for execution from the driver to executors.

                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/Task/#preferred-locations","title":"Preferred Locations
                                                                                                                                                                                                                                                          preferredLocations: Seq[TaskLocation]\n

                                                                                                                                                                                                                                                          TaskLocations that represent preferred locations (executors) to execute the task on.

                                                                                                                                                                                                                                                          Empty by default and so no task location preferences are defined that says the task could be launched on any executor.

                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                          Defined by the concrete tasks (i.e. ShuffleMapTask and ResultTask).

                                                                                                                                                                                                                                                          preferredLocations is used when TaskSetManager is requested to register a task as pending execution and dequeueSpeculativeTask.

                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/Task/#run","title":"Running Task
                                                                                                                                                                                                                                                          run(\n  taskAttemptId: Long,\n  attemptNumber: Int,\n  metricsSystem: MetricsSystem,\n  resources: Map[String, ResourceInformation],\n  plugins: Option[PluginContainer]): T\n

                                                                                                                                                                                                                                                          run registers the task (attempt) with the BlockManager.

                                                                                                                                                                                                                                                          run creates a TaskContextImpl (and perhaps a BarrierTaskContext too when the given isBarrier flag is enabled) that in turn becomes the task's TaskContext.

                                                                                                                                                                                                                                                          run checks _killed flag and, if enabled, kills the task (with interruptThread flag disabled).

                                                                                                                                                                                                                                                          run creates a Hadoop CallerContext and sets it.

                                                                                                                                                                                                                                                          run informs the given PluginContainer that the task is started.

                                                                                                                                                                                                                                                          run runs the task.

                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                          This is the moment when the custom Task's runTask is executed.

                                                                                                                                                                                                                                                          In the end, run notifies TaskContextImpl that the task has completed (regardless of the final outcome -- a success or a failure).

                                                                                                                                                                                                                                                          In case of any exceptions, run notifies TaskContextImpl that the task has failed. run requests MemoryStore to release unroll memory for this task (for both ON_HEAP and OFF_HEAP memory modes).

                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                          run uses SparkEnv to access the current BlockManager that it uses to access MemoryStore.

                                                                                                                                                                                                                                                          run requests MemoryManager to notify any tasks waiting for execution memory to be freed to wake up and try to acquire memory again.

                                                                                                                                                                                                                                                          run unsets the task's TaskContext.

                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                          run uses SparkEnv to access the current MemoryManager.

                                                                                                                                                                                                                                                          run is used when:

                                                                                                                                                                                                                                                          • TaskRunner is requested to run (when Executor is requested to launch a task (on \"Executor task launch worker\" thread pool sometime in the future))
                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/Task/#task-states","title":"Task States

                                                                                                                                                                                                                                                          Task can be in one of the following states (as described by TaskState enumeration):

                                                                                                                                                                                                                                                          • LAUNCHING
                                                                                                                                                                                                                                                          • RUNNING when the task is being started.
                                                                                                                                                                                                                                                          • FINISHED when the task finished with the serialized result.
                                                                                                                                                                                                                                                          • FAILED when the task fails, e.g. when FetchFailedException, CommitDeniedException or any Throwable occurs
                                                                                                                                                                                                                                                          • KILLED when an executor kills a task.
                                                                                                                                                                                                                                                          • LOST

                                                                                                                                                                                                                                                          States are the values of org.apache.spark.TaskState.

                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                          Task status updates are sent from executors to the driver through ExecutorBackend.

                                                                                                                                                                                                                                                          Task is finished when it is in one of FINISHED, FAILED, KILLED, LOST.

                                                                                                                                                                                                                                                          LOST and FAILED states are considered failures.

                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/Task/#collecting-latest-values-of-accumulators","title":"Collecting Latest Values of Accumulators
                                                                                                                                                                                                                                                          collectAccumulatorUpdates(\n  taskFailed: Boolean = false): Seq[AccumulableInfo]\n

                                                                                                                                                                                                                                                          collectAccumulatorUpdates collects the latest values of internal and external accumulators from a task (and returns the values as a collection of AccumulableInfo).

                                                                                                                                                                                                                                                          Internally, collectAccumulatorUpdates takes TaskMetrics.

                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                          collectAccumulatorUpdates uses TaskContextImpl to access the task's TaskMetrics.

                                                                                                                                                                                                                                                          collectAccumulatorUpdates collects the latest values of:

                                                                                                                                                                                                                                                          • internal accumulators whose current value is not the zero value and the RESULT_SIZE accumulator (regardless whether the value is its zero or not).

                                                                                                                                                                                                                                                          • external accumulators when taskFailed is disabled (false) or which should be included on failures.

                                                                                                                                                                                                                                                          collectAccumulatorUpdates returns an empty collection when TaskContextImpl is not initialized.

                                                                                                                                                                                                                                                          collectAccumulatorUpdates is used when TaskRunner runs a task (and sends a task's final results back to the driver).

                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/Task/#killing-task","title":"Killing Task
                                                                                                                                                                                                                                                          kill(\n  interruptThread: Boolean): Unit\n

                                                                                                                                                                                                                                                          kill marks the task to be killed, i.e. it sets the internal _killed flag to true.

                                                                                                                                                                                                                                                          kill calls TaskContextImpl.markInterrupted when context is set.

                                                                                                                                                                                                                                                          If interruptThread is enabled and the internal taskThread is available, kill interrupts it.

                                                                                                                                                                                                                                                          CAUTION: FIXME When could context and interruptThread not be set?

                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskContext/","title":"TaskContext","text":"

                                                                                                                                                                                                                                                          TaskContext is an abstraction of task contexts.

                                                                                                                                                                                                                                                          "},{"location":"scheduler/TaskContext/#contract-subset","title":"Contract (Subset)","text":""},{"location":"scheduler/TaskContext/#addtaskcompletionlistener","title":"addTaskCompletionListener
                                                                                                                                                                                                                                                          addTaskCompletionListener[U](\n  f: (TaskContext) => U): TaskContext\naddTaskCompletionListener(\n  listener: TaskCompletionListener): TaskContext\n

                                                                                                                                                                                                                                                          Registers a TaskCompletionListener

                                                                                                                                                                                                                                                          val rdd = sc.range(0, 5, numSlices = 1)\n\nimport org.apache.spark.TaskContext\nval printTaskInfo = (tc: TaskContext) => {\n  val msg = s\"\"\"|-------------------\n                |partitionId:   ${tc.partitionId}\n                |stageId:       ${tc.stageId}\n                |attemptNum:    ${tc.attemptNumber}\n                |taskAttemptId: ${tc.taskAttemptId}\n                |-------------------\"\"\".stripMargin\n  println(msg)\n}\n\nrdd.foreachPartition { _ =>\n  val tc = TaskContext.get\n  tc.addTaskCompletionListener(printTaskInfo)\n}\n
                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskContext/#addtaskfailurelistener","title":"addTaskFailureListener
                                                                                                                                                                                                                                                          addTaskFailureListener(\n  f: (TaskContext, Throwable) => Unit): TaskContext\naddTaskFailureListener(\n  listener: TaskFailureListener): TaskContext\n

                                                                                                                                                                                                                                                          Registers a TaskFailureListener

                                                                                                                                                                                                                                                          val rdd = sc.range(0, 2, numSlices = 2)\n\nimport org.apache.spark.TaskContext\nval printTaskErrorInfo = (tc: TaskContext, error: Throwable) => {\n  val msg = s\"\"\"|-------------------\n                |partitionId:   ${tc.partitionId}\n                |stageId:       ${tc.stageId}\n                |attemptNum:    ${tc.attemptNumber}\n                |taskAttemptId: ${tc.taskAttemptId}\n                |error:         ${error.toString}\n                |-------------------\"\"\".stripMargin\n  println(msg)\n}\n\nval throwExceptionForOddNumber = (n: Long) => {\n  if (n % 2 == 1) {\n    throw new Exception(s\"No way it will pass for odd number: $n\")\n  }\n}\n\n// FIXME It won't work.\nrdd.map(throwExceptionForOddNumber).foreachPartition { _ =>\n  val tc = TaskContext.get\n  tc.addTaskFailureListener(printTaskErrorInfo)\n}\n\n// Listener registration matters.\nrdd.mapPartitions { (it: Iterator[Long]) =>\n  val tc = TaskContext.get\n  tc.addTaskFailureListener(printTaskErrorInfo)\n  it\n}.map(throwExceptionForOddNumber).count\n
                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskContext/#fetchfailed","title":"fetchFailed
                                                                                                                                                                                                                                                          fetchFailed: Option[FetchFailedException]\n

                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                          • TaskRunner is requested to run
                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskContext/#getkillreason","title":"getKillReason
                                                                                                                                                                                                                                                          getKillReason(): Option[String]\n
                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskContext/#getlocalproperty","title":"getLocalProperty
                                                                                                                                                                                                                                                          getLocalProperty(\n  key: String): String\n

                                                                                                                                                                                                                                                          Looks up a local property by key

                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskContext/#getmetricssources","title":"getMetricsSources
                                                                                                                                                                                                                                                          getMetricsSources(\n  sourceName: String): Seq[Source]\n

                                                                                                                                                                                                                                                          Looks up Sources by name

                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskContext/#registering-accumulator","title":"Registering Accumulator
                                                                                                                                                                                                                                                          registerAccumulator(\n  a: AccumulatorV2[_, _]): Unit\n

                                                                                                                                                                                                                                                          Registers a AccumulatorV2

                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                          • AccumulatorV2 is requested to deserialize itself
                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskContext/#resources","title":"Resources
                                                                                                                                                                                                                                                          resources(): Map[String, ResourceInformation]\n

                                                                                                                                                                                                                                                          Resources (names) allocated to this task

                                                                                                                                                                                                                                                          See:

                                                                                                                                                                                                                                                          • TaskContextImpl
                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskContext/#taskmetrics","title":"taskMetrics
                                                                                                                                                                                                                                                          taskMetrics(): TaskMetrics\n

                                                                                                                                                                                                                                                          TaskMetrics

                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskContext/#others","title":"others

                                                                                                                                                                                                                                                          Important

                                                                                                                                                                                                                                                          There are other methods, but don't seem very interesting.

                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskContext/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                          • BarrierTaskContext
                                                                                                                                                                                                                                                          • TaskContextImpl
                                                                                                                                                                                                                                                          "},{"location":"scheduler/TaskContext/#serializable","title":"Serializable

                                                                                                                                                                                                                                                          TaskContext is a Serializable (Java).

                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskContext/#accessing-taskcontext","title":"Accessing TaskContext
                                                                                                                                                                                                                                                          get(): TaskContext\n

                                                                                                                                                                                                                                                          get returns the thread-local TaskContext instance.

                                                                                                                                                                                                                                                          import org.apache.spark.TaskContext\nval tc = TaskContext.get\n
                                                                                                                                                                                                                                                          val rdd = sc.range(0, 3, numSlices = 3)\n\nassert(rdd.partitions.size == 3)\n\nrdd.foreach { n =>\n  import org.apache.spark.TaskContext\n  val tc = TaskContext.get\n  val msg = s\"\"\"|-------------------\n                |partitionId:   ${tc.partitionId}\n                |stageId:       ${tc.stageId}\n                |attemptNum:    ${tc.attemptNumber}\n                |taskAttemptId: ${tc.taskAttemptId}\n                |-------------------\"\"\".stripMargin\n  println(msg)\n}\n
                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskContextImpl/","title":"TaskContextImpl","text":"

                                                                                                                                                                                                                                                          TaskContextImpl is a concrete TaskContext.

                                                                                                                                                                                                                                                          "},{"location":"scheduler/TaskContextImpl/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                          TaskContextImpl takes the following to be created:

                                                                                                                                                                                                                                                          • Stage ID
                                                                                                                                                                                                                                                          • Stage Execution Attempt ID
                                                                                                                                                                                                                                                          • Partition ID
                                                                                                                                                                                                                                                          • Task Execution Attempt ID
                                                                                                                                                                                                                                                          • Attempt Number
                                                                                                                                                                                                                                                          • TaskMemoryManager
                                                                                                                                                                                                                                                          • Local Properties
                                                                                                                                                                                                                                                          • MetricsSystem
                                                                                                                                                                                                                                                          • TaskMetrics
                                                                                                                                                                                                                                                          • Resources
                                                                                                                                                                                                                                                          • TaskContextImpl is created\u00a0when:

                                                                                                                                                                                                                                                            • Task is requested to run
                                                                                                                                                                                                                                                            "},{"location":"scheduler/TaskContextImpl/#resources","title":"Resources","text":"TaskContext
                                                                                                                                                                                                                                                            resources: Map[String, ResourceInformation]\n

                                                                                                                                                                                                                                                            resources is part of the TaskContext abstraction.

                                                                                                                                                                                                                                                            TaskContextImpl can be given resources (names) when created.

                                                                                                                                                                                                                                                            The resources are given when a Task is requested to run that in turn come from a TaskDescription (of a TaskRunner).

                                                                                                                                                                                                                                                            "},{"location":"scheduler/TaskContextImpl/#barriertaskcontext","title":"BarrierTaskContext

                                                                                                                                                                                                                                                            TaskContextImpl is available to barrier tasks as a BarrierTaskContext.

                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/TaskDescription/","title":"TaskDescription","text":"

                                                                                                                                                                                                                                                            TaskDescription is a metadata of a Task.

                                                                                                                                                                                                                                                            "},{"location":"scheduler/TaskDescription/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                            TaskDescription takes the following to be created:

                                                                                                                                                                                                                                                            • Task ID
                                                                                                                                                                                                                                                            • Task attempt number
                                                                                                                                                                                                                                                            • Executor ID
                                                                                                                                                                                                                                                            • Task name
                                                                                                                                                                                                                                                            • Task index (within the TaskSet)
                                                                                                                                                                                                                                                            • Partition ID
                                                                                                                                                                                                                                                            • Added files (as Map[String, Long])
                                                                                                                                                                                                                                                            • Added JAR files (as Map[String, Long])
                                                                                                                                                                                                                                                            • Properties
                                                                                                                                                                                                                                                            • Resources
                                                                                                                                                                                                                                                            • Serialized task (as ByteBuffer)

                                                                                                                                                                                                                                                              TaskDescription is created when:

                                                                                                                                                                                                                                                              • TaskSetManager is requested to find a task ready for execution (given a resource offer)
                                                                                                                                                                                                                                                              "},{"location":"scheduler/TaskDescription/#resources","title":"Resources","text":"
                                                                                                                                                                                                                                                              resources: Map[String, ResourceInformation]\n

                                                                                                                                                                                                                                                              TaskDescription is given resources when created.

                                                                                                                                                                                                                                                              The resources are either specified when TaskSetManager is requested to resourceOffer (and prepareLaunchingTask) or decoded from bytes.

                                                                                                                                                                                                                                                              "},{"location":"scheduler/TaskDescription/#text-representation","title":"Text Representation
                                                                                                                                                                                                                                                              toString: String\n

                                                                                                                                                                                                                                                              toString uses the taskId and index as follows:

                                                                                                                                                                                                                                                              TaskDescription(TID=[taskId], index=[index])\n
                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/TaskDescription/#decoding-taskdescription-from-serialized-format","title":"Decoding TaskDescription (from Serialized Format)
                                                                                                                                                                                                                                                              decode(\n  byteBuffer: ByteBuffer): TaskDescription\n

                                                                                                                                                                                                                                                              decode simply decodes (<>) a TaskDescription from the serialized format (ByteBuffer).

                                                                                                                                                                                                                                                              Internally, decode...FIXME

                                                                                                                                                                                                                                                              decode is used when:

                                                                                                                                                                                                                                                              • CoarseGrainedExecutorBackend is requested to CoarseGrainedExecutorBackend.md#LaunchTask[handle a LaunchTask message]

                                                                                                                                                                                                                                                              • Spark on Mesos' MesosExecutorBackend is requested to spark-on-mesos:spark-executor-backends-MesosExecutorBackend.md#launchTask[launch a task]

                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/TaskDescription/#encoding-taskdescription-to-serialized-format","title":"Encoding TaskDescription (to Serialized Format)
                                                                                                                                                                                                                                                              encode(\n  taskDescription: TaskDescription): ByteBuffer\n

                                                                                                                                                                                                                                                              encode simply encodes the TaskDescription to a serialized format (ByteBuffer).

                                                                                                                                                                                                                                                              Internally, encode...FIXME

                                                                                                                                                                                                                                                              encode is used when:

                                                                                                                                                                                                                                                              • DriverEndpoint (of CoarseGrainedSchedulerBackend) is requested to launchTasks
                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/TaskDescription/#task-name","title":"Task Name

                                                                                                                                                                                                                                                              The name of the task is of the format:

                                                                                                                                                                                                                                                              task [taskID] in stage [taskSetID]\n
                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/TaskInfo/","title":"TaskInfo","text":"

                                                                                                                                                                                                                                                              == [[TaskInfo]] TaskInfo

                                                                                                                                                                                                                                                              TaskInfo is information about a running task attempt inside a scheduler:TaskSet.md[TaskSet].

                                                                                                                                                                                                                                                              TaskInfo is created when:

                                                                                                                                                                                                                                                              • scheduler:TaskSetManager.md#resourceOffer[TaskSetManager dequeues a task for execution (given resource offer)] (and records the task as running)

                                                                                                                                                                                                                                                              • TaskUIData does dropInternalAndSQLAccumulables

                                                                                                                                                                                                                                                              • JsonProtocol utility is used to spark-history-server:JsonProtocol.md#taskInfoFromJson[re-create a task details from JSON]

                                                                                                                                                                                                                                                              NOTE: Back then, at the commit 63051dd2bcc4bf09d413ff7cf89a37967edc33ba, when TaskInfo was first merged to Apache Spark on 07/06/12, TaskInfo was part of spark.scheduler.mesos package -- note \"Mesos\" in the name of the package that shows how much Spark and Mesos influenced each other at that time.

                                                                                                                                                                                                                                                              [[internal-registries]] .TaskInfo's Internal Registries and Counters [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                                                                                                                                              | [[finishTime]] finishTime | Time when TaskInfo was <>.

                                                                                                                                                                                                                                                              Used when...FIXME |===

                                                                                                                                                                                                                                                              === [[creating-instance]] Creating TaskInfo Instance

                                                                                                                                                                                                                                                              TaskInfo takes the following when created:

                                                                                                                                                                                                                                                              • [[taskId]] Task ID
                                                                                                                                                                                                                                                              • [[index]] Index of the task within its scheduler:TaskSet.md[TaskSet] that may not necessarily be the same as the ID of the RDD partition that the task is computing.
                                                                                                                                                                                                                                                              • [[attemptNumber]] Task attempt ID
                                                                                                                                                                                                                                                              • [[launchTime]] Time when the task was dequeued for execution
                                                                                                                                                                                                                                                              • [[executorId]] Executor that has been offered (as a resource) to run the task
                                                                                                                                                                                                                                                              • [[host]] Host of the <>
                                                                                                                                                                                                                                                              • [[taskLocality]] scheduler:TaskSchedulerImpl.md#TaskLocality[TaskLocality], i.e. locality preference of the task
                                                                                                                                                                                                                                                              • [[speculative]] Flag whether a task is speculative or not
                                                                                                                                                                                                                                                              • TaskInfo initializes the <>.

                                                                                                                                                                                                                                                                === [[markFinished]] Marking Task As Finished (Successfully or Not) -- markFinished Method

                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskInfo/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/TaskInfo/#markfinishedstate-taskstate-time-long-systemcurrenttimemillis-unit","title":"markFinished(state: TaskState, time: Long = System.currentTimeMillis): Unit","text":"

                                                                                                                                                                                                                                                                markFinished records the input time as <>.

                                                                                                                                                                                                                                                                markFinished marks TaskInfo as <> when the input state is FAILED or <> for state being KILLED.

                                                                                                                                                                                                                                                                NOTE: markFinished is used when TaskSetManager is notified that a task has finished scheduler:TaskSetManager.md#handleSuccessfulTask[successfully] or scheduler:TaskSetManager.md#handleFailedTask[failed].

                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskLocation/","title":"TaskLocation","text":"

                                                                                                                                                                                                                                                                TaskLocation represents a placement preference of an RDD partition, i.e. a hint of the location to submit scheduler:Task.md[tasks] for execution.

                                                                                                                                                                                                                                                                TaskLocations are tracked by scheduler:DAGScheduler.md#cacheLocs[DAGScheduler] for scheduler:DAGScheduler.md#submitMissingTasks[submitting missing tasks of a stage].

                                                                                                                                                                                                                                                                TaskLocation is available as scheduler:Task.md#preferredLocations[preferredLocations] of a task.

                                                                                                                                                                                                                                                                [[host]] Every TaskLocation describes the location by host name, but could also use other location-related metadata.

                                                                                                                                                                                                                                                                TaskLocations of an RDD and a partition is available using SparkContext.md#getPreferredLocs[SparkContext.getPreferredLocs] method.

                                                                                                                                                                                                                                                                Sealed

                                                                                                                                                                                                                                                                TaskLocation is a Scala private[spark] sealed trait so all the available implementations of TaskLocation trait are in a single Scala file.

                                                                                                                                                                                                                                                                == [[ExecutorCacheTaskLocation]] ExecutorCacheTaskLocation

                                                                                                                                                                                                                                                                ExecutorCacheTaskLocation describes a <> and an executor.

                                                                                                                                                                                                                                                                ExecutorCacheTaskLocation informs the Scheduler to prefer a given executor, but the next level of preference is any executor on the same host if this is not possible.

                                                                                                                                                                                                                                                                == [[HDFSCacheTaskLocation]] HDFSCacheTaskLocation

                                                                                                                                                                                                                                                                HDFSCacheTaskLocation describes a <> that is cached by HDFS.

                                                                                                                                                                                                                                                                Used exclusively when rdd:HadoopRDD.md#getPreferredLocations[HadoopRDD] and rdd:NewHadoopRDD.md#getPreferredLocations[NewHadoopRDD] are requested for their placement preferences (aka preferred locations).

                                                                                                                                                                                                                                                                == [[HostTaskLocation]] HostTaskLocation

                                                                                                                                                                                                                                                                HostTaskLocation describes a <> only."},{"location":"scheduler/TaskResult/","title":"TaskResult","text":"

                                                                                                                                                                                                                                                                TaskResult is an abstraction of task results (of type T).

                                                                                                                                                                                                                                                                The decision what TaskResult type to use is made when TaskRunner finishes running a task.

                                                                                                                                                                                                                                                                Sealed Trait

                                                                                                                                                                                                                                                                TaskResult is a Scala sealed trait which means that all of the implementations are in the same compilation unit (a single file).

                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskResult/#directtaskresult","title":"DirectTaskResult

                                                                                                                                                                                                                                                                DirectTaskResult is a TaskResult to be serialized and sent over the wire to the driver together with the following:

                                                                                                                                                                                                                                                                • Value Bytes (java.nio.ByteBuffer)
                                                                                                                                                                                                                                                                • Accumulator updates
                                                                                                                                                                                                                                                                • Metric Peaks

                                                                                                                                                                                                                                                                  DirectTaskResult is used when the size of a task result is below spark.driver.maxResultSize and the maximum size of direct results.

                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskResult/#indirecttaskresult","title":"IndirectTaskResult

                                                                                                                                                                                                                                                                  IndirectTaskResult is a \"pointer\" to a task result that is available in a BlockManager:

                                                                                                                                                                                                                                                                  • BlockId
                                                                                                                                                                                                                                                                  • Size

                                                                                                                                                                                                                                                                    IndirectTaskResult is a java.io.Serializable.

                                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/TaskResult/#externalizable","title":"Externalizable

                                                                                                                                                                                                                                                                    DirectTaskResult is an Externalizable (Java).

                                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/TaskResultGetter/","title":"TaskResultGetter","text":"

                                                                                                                                                                                                                                                                    TaskResultGetter is a helper class of scheduler:TaskSchedulerImpl.md#statusUpdate[TaskSchedulerImpl] for asynchronous deserialization of <> (possibly fetching remote blocks) or <>.

                                                                                                                                                                                                                                                                    CAUTION: FIXME Image with the dependencies

                                                                                                                                                                                                                                                                    TIP: Consult scheduler:Task.md#states[Task States] in Tasks to learn about the different task states.

                                                                                                                                                                                                                                                                    NOTE: The only instance of TaskResultGetter is created while scheduler:TaskSchedulerImpl.md#creating-instance[TaskSchedulerImpl is created].

                                                                                                                                                                                                                                                                    TaskResultGetter requires a core:SparkEnv.md[SparkEnv] and scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl] to be created and is stopped when scheduler:TaskSchedulerImpl.md#stop[TaskSchedulerImpl stops].

                                                                                                                                                                                                                                                                    TaskResultGetter uses <task-result-getter asynchronous task executor>> for operation."},{"location":"scheduler/TaskResultGetter/#tip","title":"[TIP]","text":"

                                                                                                                                                                                                                                                                    Enable DEBUG logging level for org.apache.spark.scheduler.TaskResultGetter logger to see what happens inside.

                                                                                                                                                                                                                                                                    Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                    log4j.logger.org.apache.spark.scheduler.TaskResultGetter=DEBUG\n
                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskResultGetter/#refer-to-spark-loggingmdlogging","title":"Refer to spark-logging.md[Logging].","text":"

                                                                                                                                                                                                                                                                    === [[getTaskResultExecutor]][[task-result-getter]] task-result-getter Asynchronous Task Executor

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskResultGetter/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/TaskResultGetter/#gettaskresultexecutor-executorservice","title":"getTaskResultExecutor: ExecutorService","text":"

                                                                                                                                                                                                                                                                    getTaskResultExecutor creates a daemon thread pool with <> threads and task-result-getter prefix.

                                                                                                                                                                                                                                                                    TIP: Read up on https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html[java.util.concurrent.ThreadPoolExecutor] that getTaskResultExecutor uses under the covers.

                                                                                                                                                                                                                                                                    === [[stop]] stop Method

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskResultGetter/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/TaskResultGetter/#stop-unit","title":"stop(): Unit","text":"

                                                                                                                                                                                                                                                                    stop stops the internal <task-result-getter asynchronous task executor>>.

                                                                                                                                                                                                                                                                    === [[serializer]] serializer Attribute

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskResultGetter/#source-scala_2","title":"[source, scala]","text":""},{"location":"scheduler/TaskResultGetter/#serializer-threadlocalserializerinstance","title":"serializer: ThreadLocal[SerializerInstance]","text":"

                                                                                                                                                                                                                                                                    serializer is a thread-local serializer:SerializerInstance.md[SerializerInstance] that TaskResultGetter uses to deserialize byte buffers (with TaskResults or a TaskEndReason).

                                                                                                                                                                                                                                                                    When created for a new thread, serializer is initialized with a new instance of Serializer (using core:SparkEnv.md#closureSerializer[SparkEnv.closureSerializer]).

                                                                                                                                                                                                                                                                    NOTE: TaskResultGetter uses https://docs.oracle.com/javase/8/docs/api/java/lang/ThreadLocal.html[java.lang.ThreadLocal] for the thread-local SerializerInstance variable.

                                                                                                                                                                                                                                                                    === [[taskResultSerializer]] taskResultSerializer Attribute

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskResultGetter/#source-scala_3","title":"[source, scala]","text":""},{"location":"scheduler/TaskResultGetter/#taskresultserializer-threadlocalserializerinstance","title":"taskResultSerializer: ThreadLocal[SerializerInstance]","text":"

                                                                                                                                                                                                                                                                    taskResultSerializer is a thread-local serializer:SerializerInstance.md[SerializerInstance] that TaskResultGetter uses to...

                                                                                                                                                                                                                                                                    When created for a new thread, taskResultSerializer is initialized with a new instance of Serializer (using core:SparkEnv.md#serializer[SparkEnv.serializer]).

                                                                                                                                                                                                                                                                    NOTE: TaskResultGetter uses https://docs.oracle.com/javase/8/docs/api/java/lang/ThreadLocal.html[java.lang.ThreadLocal] for the thread-local SerializerInstance variable.

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskResultGetter/#enqueuing-successful-task","title":"Enqueuing Successful Task
                                                                                                                                                                                                                                                                    enqueueSuccessfulTask(\n  taskSetManager: TaskSetManager,\n  tid: Long,\n  serializedData: ByteBuffer): Unit\n

                                                                                                                                                                                                                                                                    enqueueSuccessfulTask submits an asynchronous task (to <> asynchronous task executor) that first deserializes serializedData to a DirectTaskResult, then updates the internal accumulator (with the size of the DirectTaskResult) and ultimately notifies the TaskSchedulerImpl that the tid task was completed and scheduler:TaskSchedulerImpl.md#handleSuccessfulTask[the task result was received successfully] or scheduler:TaskSchedulerImpl.md#handleFailedTask[not].

                                                                                                                                                                                                                                                                    NOTE: enqueueSuccessfulTask is just the asynchronous task enqueued for execution by <> asynchronous task executor at some point in the future.

                                                                                                                                                                                                                                                                    Internally, the enqueued task first deserializes serializedData to a TaskResult (using the internal thread-local <>).

                                                                                                                                                                                                                                                                    For a DirectTaskResult, the task scheduler:TaskSetManager.md#canFetchMoreResults[checks the available memory for the task result] and, when the size overflows configuration-properties.md#spark.driver.maxResultSize[spark.driver.maxResultSize], it simply returns.

                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                    enqueueSuccessfulTask is a mere thread so returning from a thread is to do nothing else. That is why the check for quota does abort when there is not enough memory.

                                                                                                                                                                                                                                                                    Otherwise, when there is enough memory to hold the task result, it deserializes the DirectTaskResult (using the internal thread-local <>).

                                                                                                                                                                                                                                                                    For an IndirectTaskResult, the task checks the available memory for the task result and, when the size could overflow the maximum result size, it storage:BlockManagerMaster.md#removeBlock[removes the block] and simply returns.

                                                                                                                                                                                                                                                                    Otherwise, when there is enough memory to hold the task result, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                    Fetching indirect task result for TID [tid]\n

                                                                                                                                                                                                                                                                    The task scheduler:TaskSchedulerImpl.md#handleTaskGettingResult[notifies TaskSchedulerImpl that it is about to fetch a remote block for a task result]. It then storage:BlockManager.md#getRemoteBytes[gets the block from remote block managers (as serialized bytes)].

                                                                                                                                                                                                                                                                    When the block could not be fetched, scheduler:TaskSchedulerImpl.md#handleFailedTask[TaskSchedulerImpl is informed] (with TaskResultLost task failure reason) and the task simply returns.

                                                                                                                                                                                                                                                                    NOTE: enqueueSuccessfulTask is a mere thread so returning from a thread is to do nothing else and so the real handling is when scheduler:TaskSchedulerImpl.md#handleFailedTask[TaskSchedulerImpl is informed].

                                                                                                                                                                                                                                                                    The task result (as a serialized byte buffer) is then deserialized to a DirectTaskResult (using the internal thread-local <>) and deserialized again using the internal thread-local <> (just like for the DirectTaskResult case). The storage:BlockManagerMaster.md#removeBlock[block is removed from BlockManagerMaster] and simply returns.

                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                    A IndirectTaskResult is deserialized twice to become the final deserialized task result (using <> for a DirectTaskResult). Compare it to a DirectTaskResult task result that is deserialized once only.

                                                                                                                                                                                                                                                                    With no exceptions thrown, enqueueSuccessfulTask scheduler:TaskSchedulerImpl.md#handleSuccessfulTask[informs the TaskSchedulerImpl that the tid task was completed and the task result was received].

                                                                                                                                                                                                                                                                    A ClassNotFoundException leads to scheduler:TaskSetManager.md#abort[aborting the TaskSet] (with ClassNotFound with classloader: [loader] error message) while any non-fatal exception shows the following ERROR message in the logs followed by scheduler:TaskSetManager.md#abort[aborting the TaskSet].

                                                                                                                                                                                                                                                                    Exception while getting task result\n

                                                                                                                                                                                                                                                                    enqueueSuccessfulTask is used when TaskSchedulerImpl is requested to handle task status update (and the task has finished successfully).

                                                                                                                                                                                                                                                                    === [[enqueueFailedTask]] Deserializing TaskFailedReason and Notifying TaskSchedulerImpl -- enqueueFailedTask Method

                                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/TaskResultGetter/#source-scala_4","title":"[source, scala]

                                                                                                                                                                                                                                                                    enqueueFailedTask( taskSetManager: TaskSetManager, tid: Long, taskState: TaskState.TaskState, serializedData: ByteBuffer): Unit

                                                                                                                                                                                                                                                                    enqueueFailedTask submits an asynchronous task (to <task-result-getter asynchronous task executor>>) that first attempts to deserialize a TaskFailedReason from serializedData (using the internal thread-local <>) and then scheduler:TaskSchedulerImpl.md#handleFailedTask[notifies TaskSchedulerImpl that the task has failed].

                                                                                                                                                                                                                                                                    Any ClassNotFoundException leads to the following ERROR message in the logs (without breaking the flow of enqueueFailedTask):

                                                                                                                                                                                                                                                                    ERROR Could not deserialize TaskEndReason: ClassNotFound with classloader [loader]\n

                                                                                                                                                                                                                                                                    NOTE: enqueueFailedTask is called when scheduler:TaskSchedulerImpl.md#statusUpdate[TaskSchedulerImpl is notified about a task that has failed (and is in FAILED, KILLED or LOST state)].

                                                                                                                                                                                                                                                                    === [[settings]] Settings

                                                                                                                                                                                                                                                                    .Spark Properties [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Spark Property | Default Value | Description | [[spark_resultGetter_threads]] spark.resultGetter.threads | 4 | The number of threads for TaskResultGetter. |===

                                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/TaskScheduler/","title":"TaskScheduler","text":"

                                                                                                                                                                                                                                                                    TaskScheduler is an abstraction of <> that can <> in a Spark application (per <>).

                                                                                                                                                                                                                                                                    NOTE: TaskScheduler works closely with scheduler:DAGScheduler.md[DAGScheduler] that <> (for every stage in a Spark job).

                                                                                                                                                                                                                                                                    TaskScheduler can track the executors available in a Spark application using <> and <> interceptors (that inform about active and lost executors, respectively).

                                                                                                                                                                                                                                                                    == [[submitTasks]] Submitting Tasks for Execution

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                    submitTasks( taskSet: TaskSet): Unit

                                                                                                                                                                                                                                                                    Submits the tasks (of the given scheduler:TaskSet.md[TaskSet]) for execution.

                                                                                                                                                                                                                                                                    Used when DAGScheduler is requested to scheduler:DAGScheduler.md#submitMissingTasks[submit missing tasks (of a stage)].

                                                                                                                                                                                                                                                                    == [[executorHeartbeatReceived]] Handling Executor Heartbeat

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                    executorHeartbeatReceived( execId: String, accumUpdates: Array[(Long, Seq[AccumulatorV2[_, _]])], blockManagerId: BlockManagerId): Boolean

                                                                                                                                                                                                                                                                    Handles a heartbeat from an executor

                                                                                                                                                                                                                                                                    Returns true when the execId executor is managed by the TaskScheduler. false indicates that the executor:Executor.md#reportHeartBeat[block manager (on the executor) should re-register].

                                                                                                                                                                                                                                                                    Used when HeartbeatReceiver RPC endpoint is requested to handle a Heartbeat (with task metrics) from an executor

                                                                                                                                                                                                                                                                    == [[killTaskAttempt]] Killing Task

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                    killTaskAttempt( taskId: Long, interruptThread: Boolean, reason: String): Boolean

                                                                                                                                                                                                                                                                    Kills a task (attempt)

                                                                                                                                                                                                                                                                    Used when DAGScheduler is requested to scheduler:DAGScheduler.md#killTaskAttempt[kill a task]

                                                                                                                                                                                                                                                                    == [[workerRemoved]] workerRemoved Notification

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                    workerRemoved( workerId: String, host: String, message: String): Unit

                                                                                                                                                                                                                                                                    Used when DriverEndpoint is requested to handle a RemoveWorker event

                                                                                                                                                                                                                                                                    == [[contract]] Contract

                                                                                                                                                                                                                                                                    [cols=\"30m,70\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                                                                                                                    | applicationAttemptId a| [[applicationAttemptId]]

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#source-scala_4","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#applicationattemptid-optionstring","title":"applicationAttemptId(): Option[String]","text":"

                                                                                                                                                                                                                                                                    Unique identifier of an (execution) attempt of the Spark application

                                                                                                                                                                                                                                                                    Used when SparkContext is created

                                                                                                                                                                                                                                                                    | cancelTasks a| [[cancelTasks]]

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#source-scala_5","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                    cancelTasks( stageId: Int, interruptThread: Boolean): Unit

                                                                                                                                                                                                                                                                    Cancels all the tasks of a given Stage.md[stage]

                                                                                                                                                                                                                                                                    Used when DAGScheduler is requested to DAGScheduler.md#failJobAndIndependentStages[failJobAndIndependentStages]

                                                                                                                                                                                                                                                                    | defaultParallelism a| [[defaultParallelism]]

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#source-scala_6","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#defaultparallelism-int","title":"defaultParallelism(): Int","text":"

                                                                                                                                                                                                                                                                    Default level of parallelism

                                                                                                                                                                                                                                                                    Used when SparkContext is requested for the default level of parallelism

                                                                                                                                                                                                                                                                    | executorLost a| [[executorLost]]

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#source-scala_7","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                    executorLost( executorId: String, reason: ExecutorLossReason): Unit

                                                                                                                                                                                                                                                                    Handles an executor lost event

                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                    • HeartbeatReceiver RPC endpoint is requested to expireDeadHosts

                                                                                                                                                                                                                                                                    • DriverEndpoint RPC endpoint is requested to removes (forgets) and disables a malfunctioning executor (i.e. either lost or blacklisted for some reason)

                                                                                                                                                                                                                                                                    | killAllTaskAttempts a| [[killAllTaskAttempts]]

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#source-scala_8","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                    killAllTaskAttempts( stageId: Int, interruptThread: Boolean, reason: String): Unit

                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                    • DAGScheduler is requested to DAGScheduler.md#handleTaskCompletion[handleTaskCompletion]

                                                                                                                                                                                                                                                                    • TaskSchedulerImpl is requested to TaskSchedulerImpl.md#cancelTasks[cancel all the tasks of a stage]

                                                                                                                                                                                                                                                                    | rootPool a| [[rootPool]]

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#source-scala_9","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#rootpool-pool","title":"rootPool: Pool","text":"

                                                                                                                                                                                                                                                                    Top-level (root) scheduler:spark-scheduler-Pool.md[schedulable pool]

                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                    • TaskSchedulerImpl is requested to scheduler:TaskSchedulerImpl.md#initialize[initialize]

                                                                                                                                                                                                                                                                    • SparkContext is requested to SparkContext.md#getAllPools[getAllPools] and SparkContext.md#getPoolForName[getPoolForName]

                                                                                                                                                                                                                                                                    • TaskSchedulerImpl is requested to scheduler:TaskSchedulerImpl.md#resourceOffers[resourceOffers], scheduler:TaskSchedulerImpl.md#checkSpeculatableTasks[checkSpeculatableTasks], and scheduler:TaskSchedulerImpl.md#removeExecutor[removeExecutor]

                                                                                                                                                                                                                                                                    | schedulingMode a| [[schedulingMode]]

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#source-scala_10","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#schedulingmode-schedulingmode","title":"schedulingMode: SchedulingMode","text":"

                                                                                                                                                                                                                                                                    scheduler:spark-scheduler-SchedulingMode.md[Scheduling mode]

                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                    • TaskSchedulerImpl is scheduler:TaskSchedulerImpl.md#rootPool[created] and scheduler:TaskSchedulerImpl.md#initialize[initialized]

                                                                                                                                                                                                                                                                    • SparkContext is requested to SparkContext.md#getSchedulingMode[getSchedulingMode]

                                                                                                                                                                                                                                                                    | setDAGScheduler a| [[setDAGScheduler]]

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#source-scala_11","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#setdagschedulerdagscheduler-dagscheduler-unit","title":"setDAGScheduler(dagScheduler: DAGScheduler): Unit","text":"

                                                                                                                                                                                                                                                                    Associates a scheduler:DAGScheduler.md[DAGScheduler]

                                                                                                                                                                                                                                                                    Used when DAGScheduler is scheduler:DAGScheduler.md#creating-instance[created]

                                                                                                                                                                                                                                                                    | start a| [[start]]

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#source-scala_12","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#start-unit","title":"start(): Unit","text":"

                                                                                                                                                                                                                                                                    Starts the TaskScheduler

                                                                                                                                                                                                                                                                    Used when SparkContext is created

                                                                                                                                                                                                                                                                    | stop a| [[stop]]

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#source-scala_13","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#stop-unit","title":"stop(): Unit","text":"

                                                                                                                                                                                                                                                                    Stops the TaskScheduler

                                                                                                                                                                                                                                                                    Used when DAGScheduler is requested to scheduler:DAGScheduler.md#stop[stop]

                                                                                                                                                                                                                                                                    |===

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#lifecycle","title":"Lifecycle","text":"

                                                                                                                                                                                                                                                                    A TaskScheduler is created while SparkContext is being created (by calling SparkContext.createTaskScheduler for a given master URL and deploy mode).

                                                                                                                                                                                                                                                                    At this point in SparkContext's lifecycle, the internal _taskScheduler points at the TaskScheduler (and it is \"announced\" by sending a blocking TaskSchedulerIsSet message to HeartbeatReceiver RPC endpoint).

                                                                                                                                                                                                                                                                    The <> right after the blocking TaskSchedulerIsSet message receives a response.

                                                                                                                                                                                                                                                                    The <> and the <> are set at this point (and SparkContext uses the application id to set SparkConf.md#spark.app.id[spark.app.id] Spark property, and configure webui:spark-webui-SparkUI.md[SparkUI], and storage:BlockManager.md[BlockManager]).

                                                                                                                                                                                                                                                                    CAUTION: FIXME The application id is described as \"associated with the job.\" in TaskScheduler, but I think it is \"associated with the application\" and you can have many jobs per application.

                                                                                                                                                                                                                                                                    Right before SparkContext is fully initialized, <> is called.

                                                                                                                                                                                                                                                                    The internal _taskScheduler is cleared (i.e. set to null) while SparkContext.md#stop[SparkContext is being stopped].

                                                                                                                                                                                                                                                                    <> while scheduler:DAGScheduler.md#stop[DAGScheduler is being stopped].

                                                                                                                                                                                                                                                                    WARNING: FIXME If it is SparkContext to start a TaskScheduler, shouldn't SparkContext stop it too? Why is this the way it is now?

                                                                                                                                                                                                                                                                    == [[postStartHook]] Post-Start Initialization

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#source-scala_14","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#poststarthook-unit","title":"postStartHook(): Unit","text":"

                                                                                                                                                                                                                                                                    postStartHook does nothing by default, but allows <> for some additional post-start initialization.

                                                                                                                                                                                                                                                                    postStartHook is used when:

                                                                                                                                                                                                                                                                    • SparkContext is created

                                                                                                                                                                                                                                                                    • Spark on YARN's YarnClusterScheduler is requested to spark-on-yarn:spark-yarn-yarnclusterscheduler.md#postStartHook[postStartHook]

                                                                                                                                                                                                                                                                    == [[applicationId]][[appId]] Unique Identifier of Spark Application

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskScheduler/#source-scala_15","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#applicationid-string","title":"applicationId(): String","text":"

                                                                                                                                                                                                                                                                    applicationId is the unique identifier of the Spark application and defaults to spark-application-[currentTimeMillis].

                                                                                                                                                                                                                                                                    applicationId is used when SparkContext is created.

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskSchedulerImpl/","title":"TaskSchedulerImpl","text":"

                                                                                                                                                                                                                                                                    TaskSchedulerImpl is a TaskScheduler that uses a SchedulerBackend to schedule tasks (for execution on a cluster manager).

                                                                                                                                                                                                                                                                    When a Spark application starts (and so an instance of SparkContext is created) TaskSchedulerImpl with a SchedulerBackend and DAGScheduler are created and soon started.

                                                                                                                                                                                                                                                                    TaskSchedulerImpl generates tasks based on executor resource offers.

                                                                                                                                                                                                                                                                    TaskSchedulerImpl can track racks per host and port (that however is only used with Hadoop YARN cluster manager).

                                                                                                                                                                                                                                                                    Using spark.scheduler.mode configuration property you can select the scheduling policy.

                                                                                                                                                                                                                                                                    TaskSchedulerImpl submits tasks using SchedulableBuilders.

                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskSchedulerImpl/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                    TaskSchedulerImpl takes the following to be created:

                                                                                                                                                                                                                                                                    • SparkContext
                                                                                                                                                                                                                                                                    • Maximum Number of Task Failures
                                                                                                                                                                                                                                                                    • isLocal flag (default: false)
                                                                                                                                                                                                                                                                    • Clock (default: SystemClock)

                                                                                                                                                                                                                                                                      While being created, TaskSchedulerImpl sets schedulingMode to the value of spark.scheduler.mode configuration property.

                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                      schedulingMode is part of the TaskScheduler abstraction.

                                                                                                                                                                                                                                                                      TaskSchedulerImpl throws a SparkException for unrecognized scheduling mode:

                                                                                                                                                                                                                                                                      Unrecognized spark.scheduler.mode: [schedulingModeConf]\n

                                                                                                                                                                                                                                                                      In the end, TaskSchedulerImpl creates a TaskResultGetter.

                                                                                                                                                                                                                                                                      TaskSchedulerImpl is created\u00a0when:

                                                                                                                                                                                                                                                                      • SparkContext is requested for a TaskScheduler (for local and spark master URLs)
                                                                                                                                                                                                                                                                      • KubernetesClusterManager and MesosClusterManager are requested for a TaskScheduler
                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSchedulerImpl/#maxTaskFailures","title":"Maximum Number of Task Failures","text":"

                                                                                                                                                                                                                                                                      TaskSchedulerImpl can be given the maximum number of task failures when created or default to spark.task.maxFailures configuration property.

                                                                                                                                                                                                                                                                      The number of task failures is used when submitting tasks (to create a TaskSetManager).

                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSchedulerImpl/#sparktaskcpus","title":"spark.task.cpus

                                                                                                                                                                                                                                                                      TaskSchedulerImpl uses spark.task.cpus configuration property for...FIXME

                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#backend","title":"SchedulerBackend
                                                                                                                                                                                                                                                                      backend: SchedulerBackend\n

                                                                                                                                                                                                                                                                      TaskSchedulerImpl is given a SchedulerBackend when requested to initialize.

                                                                                                                                                                                                                                                                      The lifecycle of the SchedulerBackend is tightly coupled to the lifecycle of the TaskSchedulerImpl:

                                                                                                                                                                                                                                                                      • It is started when TaskSchedulerImpl is
                                                                                                                                                                                                                                                                      • It is stopped when TaskSchedulerImpl is

                                                                                                                                                                                                                                                                      TaskSchedulerImpl waits until the SchedulerBackend is ready before requesting it for the following:

                                                                                                                                                                                                                                                                      • Reviving resource offers when requested to submitTasks, statusUpdate, handleFailedTask, checkSpeculatableTasks, and executorLost

                                                                                                                                                                                                                                                                      • Killing tasks when requested to killTaskAttempt and killAllTaskAttempts

                                                                                                                                                                                                                                                                      • Default parallelism, applicationId and applicationAttemptId when requested for the defaultParallelism, applicationId and applicationAttemptId, respectively

                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#unique-identifier-of-spark-application","title":"Unique Identifier of Spark Application
                                                                                                                                                                                                                                                                      applicationId(): String\n

                                                                                                                                                                                                                                                                      applicationId is part of the TaskScheduler abstraction.

                                                                                                                                                                                                                                                                      applicationId simply request the SchedulerBackend for the applicationId.

                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#cancelling-all-tasks-of-stage","title":"Cancelling All Tasks of Stage
                                                                                                                                                                                                                                                                      cancelTasks(\n  stageId: Int,\n  interruptThread: Boolean): Unit\n

                                                                                                                                                                                                                                                                      cancelTasks is part of the TaskScheduler abstraction.

                                                                                                                                                                                                                                                                      cancelTasks cancels all tasks submitted for execution in a stage stageId.

                                                                                                                                                                                                                                                                      cancelTasks is used when:

                                                                                                                                                                                                                                                                      • DAGScheduler is requested to failJobAndIndependentStages
                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#handlesuccessfultask","title":"handleSuccessfulTask
                                                                                                                                                                                                                                                                      handleSuccessfulTask(\n  taskSetManager: TaskSetManager,\n  tid: Long,\n  taskResult: DirectTaskResult[_]): Unit\n

                                                                                                                                                                                                                                                                      handleSuccessfulTask requests the given TaskSetManager to handleSuccessfulTask (with the given tid and taskResult).

                                                                                                                                                                                                                                                                      handleSuccessfulTask is used when:

                                                                                                                                                                                                                                                                      • TaskResultGetter is requested to enqueueSuccessfulTask
                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#handletaskgettingresult","title":"handleTaskGettingResult
                                                                                                                                                                                                                                                                      handleTaskGettingResult(\n  taskSetManager: TaskSetManager,\n  tid: Long): Unit\n

                                                                                                                                                                                                                                                                      handleTaskGettingResult requests the given TaskSetManager to handleTaskGettingResult.

                                                                                                                                                                                                                                                                      handleTaskGettingResult is used when:

                                                                                                                                                                                                                                                                      • TaskResultGetter is requested to enqueueSuccessfulTask
                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#initializing","title":"Initializing
                                                                                                                                                                                                                                                                      initialize(\n  backend: SchedulerBackend): Unit\n

                                                                                                                                                                                                                                                                      initialize initializes the TaskSchedulerImpl with the given SchedulerBackend.

                                                                                                                                                                                                                                                                      initialize saves the given SchedulerBackend.

                                                                                                                                                                                                                                                                      initialize then sets <Pool>> as an empty-named Pool.md[Pool] (passing in <>, initMinShare and initWeight as 0).

                                                                                                                                                                                                                                                                      NOTE: <> and <> are a part of scheduler:TaskScheduler.md#contract[TaskScheduler Contract].

                                                                                                                                                                                                                                                                      initialize sets <> (based on <>):

                                                                                                                                                                                                                                                                      • FIFOSchedulableBuilder.md[FIFOSchedulableBuilder] for FIFO scheduling mode
                                                                                                                                                                                                                                                                      • FairSchedulableBuilder.md[FairSchedulableBuilder] for FAIR scheduling mode

                                                                                                                                                                                                                                                                      initialize SchedulableBuilder.md#buildPools[requests SchedulableBuilder to build pools].

                                                                                                                                                                                                                                                                      CAUTION: FIXME Why are rootPool and schedulableBuilder created only now? What do they need that it is not available when TaskSchedulerImpl is created?

                                                                                                                                                                                                                                                                      NOTE: initialize is called while SparkContext.md#createTaskScheduler[SparkContext is created and creates SchedulerBackend and TaskScheduler].

                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#starting-taskschedulerimpl","title":"Starting TaskSchedulerImpl
                                                                                                                                                                                                                                                                      start(): Unit\n

                                                                                                                                                                                                                                                                      start starts the SchedulerBackend and the task-scheduler-speculation executor service.

                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#handling-task-status-update","title":"Handling Task Status Update
                                                                                                                                                                                                                                                                      statusUpdate(\n  tid: Long,\n  state: TaskState,\n  serializedData: ByteBuffer): Unit\n

                                                                                                                                                                                                                                                                      statusUpdate finds TaskSetManager for the input tid task (in <>).

                                                                                                                                                                                                                                                                      When state is LOST, statusUpdate...FIXME

                                                                                                                                                                                                                                                                      NOTE: TaskState.LOST is only used by the deprecated Mesos fine-grained scheduling mode.

                                                                                                                                                                                                                                                                      When state is one of the scheduler:Task.md#states[finished states], i.e. FINISHED, FAILED, KILLED or LOST, statusUpdate <> for the input tid.

                                                                                                                                                                                                                                                                      statusUpdate scheduler:TaskSetManager.md#removeRunningTask[requests TaskSetManager to unregister tid from running tasks].

                                                                                                                                                                                                                                                                      statusUpdate requests <> to scheduler:TaskResultGetter.md#enqueueSuccessfulTask[schedule an asynchrounous task to deserialize the task result (and notify TaskSchedulerImpl back)] for tid in FINISHED state and scheduler:TaskResultGetter.md#enqueueFailedTask[schedule an asynchrounous task to deserialize TaskFailedReason (and notify TaskSchedulerImpl back)] for tid in the other finished states (i.e. FAILED, KILLED, LOST).

                                                                                                                                                                                                                                                                      If a task is in LOST state, statusUpdate scheduler:DAGScheduler.md#executorLost[notifies DAGScheduler that the executor was lost] (with SlaveLost and the reason Task [tid] was lost, so marking the executor as lost as well.) and scheduler:SchedulerBackend.md#reviveOffers[requests SchedulerBackend to revive offers].

                                                                                                                                                                                                                                                                      In case the TaskSetManager for tid could not be found (in <> registry), you should see the following ERROR message in the logs:

                                                                                                                                                                                                                                                                      Ignoring update with state [state] for TID [tid] because its task set is gone (this is likely the result of receiving duplicate task finished status updates)\n

                                                                                                                                                                                                                                                                      Any exception is caught and reported as ERROR message in the logs:

                                                                                                                                                                                                                                                                      Exception in statusUpdate\n

                                                                                                                                                                                                                                                                      CAUTION: FIXME image with scheduler backends calling TaskSchedulerImpl.statusUpdate.

                                                                                                                                                                                                                                                                      statusUpdate is used when:

                                                                                                                                                                                                                                                                      • DriverEndpoint (of CoarseGrainedSchedulerBackend) is requested to handle a StatusUpdate message

                                                                                                                                                                                                                                                                      • LocalEndpoint is requested to handle a StatusUpdate message

                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#task-scheduler-speculation-scheduled-executor-service","title":"task-scheduler-speculation Scheduled Executor Service

                                                                                                                                                                                                                                                                      speculationScheduler is a java.util.concurrent.ScheduledExecutorService with the name task-scheduler-speculation for Speculative Execution of Tasks.

                                                                                                                                                                                                                                                                      When TaskSchedulerImpl is requested to start (in non-local run mode) with spark.speculation enabled, speculationScheduler is used to schedule checkSpeculatableTasks to execute periodically every spark.speculation.interval.

                                                                                                                                                                                                                                                                      speculationScheduler is shut down when TaskSchedulerImpl is requested to stop.

                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#checking-for-speculatable-tasks","title":"Checking for Speculatable Tasks
                                                                                                                                                                                                                                                                      checkSpeculatableTasks(): Unit\n

                                                                                                                                                                                                                                                                      checkSpeculatableTasks requests rootPool to check for speculatable tasks (if they ran for more than 100 ms) and, if there any, requests scheduler:SchedulerBackend.md#reviveOffers[SchedulerBackend to revive offers].

                                                                                                                                                                                                                                                                      NOTE: checkSpeculatableTasks is executed periodically as part of speculative-execution-of-tasks.md[].

                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#cleaning-up-after-removing-executor","title":"Cleaning up After Removing Executor
                                                                                                                                                                                                                                                                      removeExecutor(\n  executorId: String,\n  reason: ExecutorLossReason): Unit\n

                                                                                                                                                                                                                                                                      removeExecutor removes the executorId executor from the following <>: <>, executorIdToHost, executorsByHost, and hostsByRack. If the affected hosts and racks are the last entries in executorsByHost and hostsByRack, appropriately, they are removed from the registries.

                                                                                                                                                                                                                                                                      Unless reason is LossReasonPending, the executor is removed from executorIdToHost registry and Schedulable.md#executorLost[TaskSetManagers get notified].

                                                                                                                                                                                                                                                                      NOTE: The internal removeExecutor is called as part of <> and scheduler:TaskScheduler.md#executorLost[executorLost].","text":""},{"location":"scheduler/TaskSchedulerImpl/#handling-nearly-completed-sparkcontext-initialization","title":"Handling Nearly-Completed SparkContext Initialization

                                                                                                                                                                                                                                                                      postStartHook(): Unit\n

                                                                                                                                                                                                                                                                      postStartHook is part of the TaskScheduler abstraction.

                                                                                                                                                                                                                                                                      postStartHook waits until a scheduler backend is ready.

                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#waiting-until-schedulerbackend-is-ready","title":"Waiting Until SchedulerBackend is Ready
                                                                                                                                                                                                                                                                      waitBackendReady(): Unit\n

                                                                                                                                                                                                                                                                      waitBackendReady waits until the SchedulerBackend is ready. If it is, waitBackendReady returns immediately. Otherwise, waitBackendReady keeps checking every 100 milliseconds (hardcoded) or the <> is SparkContext.md#stopped[stopped].

                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                      A SchedulerBackend is ready by default.

                                                                                                                                                                                                                                                                      If the SparkContext happens to be stopped while waiting, waitBackendReady throws an IllegalStateException:

                                                                                                                                                                                                                                                                      Spark context stopped while waiting for backend\n
                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#stopping-taskschedulerimpl","title":"Stopping TaskSchedulerImpl
                                                                                                                                                                                                                                                                      stop(): Unit\n

                                                                                                                                                                                                                                                                      stop stops all the internal services, i.e. <task-scheduler-speculation executor service>>, scheduler:SchedulerBackend.md[SchedulerBackend], scheduler:TaskResultGetter.md[TaskResultGetter], and <> timer.","text":""},{"location":"scheduler/TaskSchedulerImpl/#default-level-of-parallelism","title":"Default Level of Parallelism

                                                                                                                                                                                                                                                                      defaultParallelism(): Int\n

                                                                                                                                                                                                                                                                      defaultParallelism is part of the TaskScheduler abstraction.

                                                                                                                                                                                                                                                                      defaultParallelism requests the SchedulerBackend for the default level of parallelism.

                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                      Default level of parallelism is a hint for sizing jobs that SparkContext uses to create RDDs with the right number of partitions unless specified explicitly.

                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#submitting-tasks-of-taskset-for-execution","title":"Submitting Tasks (of TaskSet) for Execution
                                                                                                                                                                                                                                                                      submitTasks(\n  taskSet: TaskSet): Unit\n

                                                                                                                                                                                                                                                                      submitTasks is part of the TaskScheduler abstraction.

                                                                                                                                                                                                                                                                      In essence, submitTasks registers a new TaskSetManager (for the given TaskSet) and requests the SchedulerBackend to handle resource allocation offers (from the scheduling system).

                                                                                                                                                                                                                                                                      Internally, submitTasks prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                      Adding task set [id] with [length] tasks\n

                                                                                                                                                                                                                                                                      submitTasks then <> (for the given TaskSet.md[TaskSet] and the <>).

                                                                                                                                                                                                                                                                      submitTasks registers (adds) the TaskSetManager per TaskSet.md#stageId[stage] and TaskSet.md#stageAttemptId[stage attempt] IDs (of the TaskSet.md[TaskSet]) in the <> internal registry.

                                                                                                                                                                                                                                                                      NOTE: <> internal registry tracks the TaskSetManager.md[TaskSetManagers] (that represent TaskSet.md[TaskSets]) per stage and stage attempts. In other words, there could be many TaskSetManagers for a single stage, each representing a unique stage attempt.

                                                                                                                                                                                                                                                                      NOTE: Not only could a task be retried (cf. <>), but also a single stage.

                                                                                                                                                                                                                                                                      submitTasks makes sure that there is exactly one active TaskSetManager (with different TaskSet) across all the managers (for the stage). Otherwise, submitTasks throws an IllegalStateException:

                                                                                                                                                                                                                                                                      more than one active taskSet for stage [stage]: [TaskSet ids]\n

                                                                                                                                                                                                                                                                      NOTE: TaskSetManager is considered active when it is not a zombie.

                                                                                                                                                                                                                                                                      submitTasks requests the <> to SchedulableBuilder.md#addTaskSetManager[add the TaskSetManager to the schedulable pool].

                                                                                                                                                                                                                                                                      NOTE: The TaskScheduler.md#rootPool[schedulable pool] can be a single flat linked queue (in FIFOSchedulableBuilder.md[FIFO scheduling mode]) or a hierarchy of pools of Schedulables (in FairSchedulableBuilder.md[FAIR scheduling mode]).

                                                                                                                                                                                                                                                                      submitTasks <> to make sure that the requested resources (i.e. CPU and memory) are assigned to the Spark application for a <> (the very first time the Spark application is started per <> flag).

                                                                                                                                                                                                                                                                      NOTE: The very first time (<> flag is false) in cluster mode only (i.e. isLocal of the TaskSchedulerImpl is false), starvationTimer is scheduled to execute after configuration-properties.md#spark.starvation.timeout[spark.starvation.timeout] to ensure that the requested resources, i.e. CPUs and memory, were assigned by a cluster manager.

                                                                                                                                                                                                                                                                      NOTE: After the first configuration-properties.md#spark.starvation.timeout[spark.starvation.timeout] passes, the <> internal flag is true.

                                                                                                                                                                                                                                                                      In the end, submitTasks requests the <> to scheduler:SchedulerBackend.md#reviveOffers[reviveOffers].

                                                                                                                                                                                                                                                                      TIP: Use dag-scheduler-event-loop thread to step through the code in a debugger.

                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#scheduling-starvation-task","title":"Scheduling Starvation Task

                                                                                                                                                                                                                                                                      Every time the starvation timer thread is executed and hasLaunchedTask flag is false, the following WARN message is printed out to the logs:

                                                                                                                                                                                                                                                                      Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources\n

                                                                                                                                                                                                                                                                      Otherwise, when the hasLaunchedTask flag is true the timer thread cancels itself.

                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#createTaskSetManager","title":"Creating TaskSetManager
                                                                                                                                                                                                                                                                      createTaskSetManager(\n  taskSet: TaskSet,\n  maxTaskFailures: Int): TaskSetManager\n

                                                                                                                                                                                                                                                                      createTaskSetManager creates a TaskSetManager (with this TaskSchedulerImpl, the given TaskSet and the maxTaskFailures).

                                                                                                                                                                                                                                                                      createTaskSetManager is used when:

                                                                                                                                                                                                                                                                      • TaskSchedulerImpl is requested to submit a TaskSet
                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#notifying-tasksetmanager-that-task-failed","title":"Notifying TaskSetManager that Task Failed
                                                                                                                                                                                                                                                                      handleFailedTask(\n  taskSetManager: TaskSetManager,\n  tid: Long,\n  taskState: TaskState,\n  reason: TaskFailedReason): Unit\n

                                                                                                                                                                                                                                                                      handleFailedTask scheduler:TaskSetManager.md#handleFailedTask[notifies taskSetManager that tid task has failed] and, only when scheduler:TaskSetManager.md#zombie-state[taskSetManager is not in zombie state] and tid is not in KILLED state, scheduler:SchedulerBackend.md#reviveOffers[requests SchedulerBackend to revive offers].

                                                                                                                                                                                                                                                                      NOTE: handleFailedTask is called when scheduler:TaskResultGetter.md#enqueueSuccessfulTask[TaskResultGetter deserializes a TaskFailedReason] for a failed task.

                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#tasksetfinished","title":"taskSetFinished
                                                                                                                                                                                                                                                                      taskSetFinished(\n  manager: TaskSetManager): Unit\n

                                                                                                                                                                                                                                                                      taskSetFinished looks all scheduler:TaskSet.md[TaskSet]s up by the stage id (in <> registry) and removes the stage attempt from them, possibly with removing the entire stage record from taskSetsByStageIdAndAttempt registry completely (if there are no other attempts registered).

                                                                                                                                                                                                                                                                      taskSetFinished then removes manager from the parent's schedulable pool.

                                                                                                                                                                                                                                                                      You should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                      Removed TaskSet [id], whose tasks have all completed, from pool [name]\n

                                                                                                                                                                                                                                                                      taskSetFinished is used when:

                                                                                                                                                                                                                                                                      • TaskSetManager is requested to maybeFinishTaskSet
                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#notifying-dagscheduler-about-new-executor","title":"Notifying DAGScheduler About New Executor
                                                                                                                                                                                                                                                                      executorAdded(\n  execId: String,\n  host: String)\n

                                                                                                                                                                                                                                                                      executorAdded just DAGScheduler.md#executorAdded[notifies DAGScheduler that an executor was added].

                                                                                                                                                                                                                                                                      NOTE: executorAdded uses <> that was given when <>.","text":""},{"location":"scheduler/TaskSchedulerImpl/#resourceOffers","title":"Creating TaskDescriptions For Available Executor Resource Offers

                                                                                                                                                                                                                                                                      resourceOffers(\n  offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]]\n

                                                                                                                                                                                                                                                                      resourceOffers takes the resources offers and generates a collection of tasks (as TaskDescriptions) to launch (given the resources available).

                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                      A WorkerOffer represents a resource offer with CPU cores free to use on an executor.

                                                                                                                                                                                                                                                                      Internally, resourceOffers first updates <> and <> lookup tables to record new hosts and executors (given the input offers).

                                                                                                                                                                                                                                                                      For new executors (not in <>) resourceOffers <DAGScheduler that an executor was added>>.

                                                                                                                                                                                                                                                                      NOTE: TaskSchedulerImpl uses resourceOffers to track active executors.

                                                                                                                                                                                                                                                                      CAUTION: FIXME a picture with executorAdded call from TaskSchedulerImpl to DAGScheduler.

                                                                                                                                                                                                                                                                      resourceOffers requests BlacklistTracker to applyBlacklistTimeout and filters out offers on blacklisted nodes and executors.

                                                                                                                                                                                                                                                                      NOTE: resourceOffers uses the optional <> that was given when <>.

                                                                                                                                                                                                                                                                      CAUTION: FIXME Expand on blacklisting

                                                                                                                                                                                                                                                                      resourceOffers then randomly shuffles offers (to evenly distribute tasks across executors and avoid over-utilizing some executors) and initializes the local data structures tasks and availableCpus (as shown in the figure below).

                                                                                                                                                                                                                                                                      resourceOffers Pool.md#getSortedTaskSetQueue[takes TaskSets in scheduling order] from scheduler:TaskScheduler.md#rootPool[top-level Schedulable Pool].

                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                      rootPool is configured when <>.

                                                                                                                                                                                                                                                                      rootPool is part of the scheduler:TaskScheduler.md#rootPool[TaskScheduler Contract] and exclusively managed by scheduler:SchedulableBuilder.md[SchedulableBuilders], i.e. scheduler:FIFOSchedulableBuilder.md[FIFOSchedulableBuilder] and scheduler:FairSchedulableBuilder.md[FairSchedulableBuilder] (that scheduler:SchedulableBuilder.md#addTaskSetManager[manage registering TaskSetManagers with the root pool]).

                                                                                                                                                                                                                                                                      scheduler:TaskSetManager.md[TaskSetManager] manages execution of the tasks in a single scheduler:TaskSet.md[TaskSet] that represents a single scheduler:Stage.md[Stage].

                                                                                                                                                                                                                                                                      For every TaskSetManager (in scheduling order), you should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                      parentName: [name], name: [name], runningTasks: [count]\n

                                                                                                                                                                                                                                                                      Only if a new executor was added, resourceOffers scheduler:TaskSetManager.md#executorAdded[notifies every TaskSetManager about the change] (to recompute locality preferences).

                                                                                                                                                                                                                                                                      resourceOffers then takes every TaskSetManager (in scheduling order) and offers them each node in increasing order of locality levels (per scheduler:TaskSetManager.md#computeValidLocalityLevels[TaskSetManager's valid locality levels]).

                                                                                                                                                                                                                                                                      NOTE: A TaskSetManager scheduler:TaskSetManager.md#computeValidLocalityLevels[computes locality levels of the tasks] it manages.

                                                                                                                                                                                                                                                                      For every TaskSetManager and the TaskSetManager's valid locality level, resourceOffers tries to <> as long as the TaskSetManager manages to launch a task (given the locality level).

                                                                                                                                                                                                                                                                      If resourceOffers did not manage to offer resources to a TaskSetManager so it could launch any task, resourceOffers scheduler:TaskSetManager.md#abortIfCompletelyBlacklisted[requests the TaskSetManager to abort the TaskSet if completely blacklisted].

                                                                                                                                                                                                                                                                      When resourceOffers managed to launch a task, the internal <> flag gets enabled (that effectively means what the name says \"there were executors and I managed to launch a task\").

                                                                                                                                                                                                                                                                      resourceOffers is used when:

                                                                                                                                                                                                                                                                      • CoarseGrainedSchedulerBackend (via DriverEndpoint RPC endpoint) is requested to make executor resource offers
                                                                                                                                                                                                                                                                      • LocalEndpoint is requested to revive resource offers
                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#maybeinitbarriercoordinator","title":"maybeInitBarrierCoordinator
                                                                                                                                                                                                                                                                      maybeInitBarrierCoordinator(): Unit\n

                                                                                                                                                                                                                                                                      Unless a BarrierCoordinator has already been registered, maybeInitBarrierCoordinator creates a BarrierCoordinator and registers it to be known as barrierSync.

                                                                                                                                                                                                                                                                      In the end, maybeInitBarrierCoordinator prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                      Registered BarrierCoordinator endpoint\n
                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#resourceOfferSingleTaskSet","title":"Finding Tasks from TaskSetManager to Schedule on Executors
                                                                                                                                                                                                                                                                      resourceOfferSingleTaskSet(\n  taskSet: TaskSetManager,\n  maxLocality: TaskLocality,\n  shuffledOffers: Seq[WorkerOffer],\n  availableCpus: Array[Int],\n  availableResources: Array[Map[String, Buffer[String]]],\n  tasks: IndexedSeq[ArrayBuffer[TaskDescription]]): (Boolean, Option[TaskLocality])\n

                                                                                                                                                                                                                                                                      resourceOfferSingleTaskSet takes every WorkerOffer (from the input shuffledOffers) and (only if the number of available CPU cores (using the input availableCpus) is at least configuration-properties.md#spark.task.cpus[spark.task.cpus]) scheduler:TaskSetManager.md#resourceOffer[requests TaskSetManager (as the input taskSet) to find a Task to execute (given the resource offer)] (as an executor, a host, and the input maxLocality).

                                                                                                                                                                                                                                                                      resourceOfferSingleTaskSet adds the task to the input tasks collection.

                                                                                                                                                                                                                                                                      resourceOfferSingleTaskSet records the task id and TaskSetManager in some registries.

                                                                                                                                                                                                                                                                      resourceOfferSingleTaskSet decreases configuration-properties.md#spark.task.cpus[spark.task.cpus] from the input availableCpus (for the WorkerOffer).

                                                                                                                                                                                                                                                                      resourceOfferSingleTaskSet returns whether a task was launched or not.

                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                      resourceOfferSingleTaskSet asserts that the number of available CPU cores (in the input availableCpus per WorkerOffer) is at least 0.

                                                                                                                                                                                                                                                                      If there is a TaskNotSerializableException, resourceOfferSingleTaskSet prints out the following ERROR in the logs:

                                                                                                                                                                                                                                                                      Resource offer failed, task set [name] was not serializable\n

                                                                                                                                                                                                                                                                      resourceOfferSingleTaskSet is used when:

                                                                                                                                                                                                                                                                      • TaskSchedulerImpl is requested to resourceOffers
                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#TaskLocality","title":"Task Locality Preference

                                                                                                                                                                                                                                                                      TaskLocality represents a task locality preference and can be one of the following (from the most localized to the widest):

                                                                                                                                                                                                                                                                      1. PROCESS_LOCAL
                                                                                                                                                                                                                                                                      2. NODE_LOCAL
                                                                                                                                                                                                                                                                      3. NO_PREF
                                                                                                                                                                                                                                                                      4. RACK_LOCAL
                                                                                                                                                                                                                                                                      5. ANY
                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#workeroffer-free-cpu-cores-on-executor","title":"WorkerOffer \u2014 Free CPU Cores on Executor
                                                                                                                                                                                                                                                                      WorkerOffer(\n  executorId: String,\n  host: String,\n  cores: Int)\n

                                                                                                                                                                                                                                                                      WorkerOffer represents a resource offer with free CPU cores available on an executorId executor on a host.

                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#workerremoved","title":"workerRemoved
                                                                                                                                                                                                                                                                      workerRemoved(\n  workerId: String,\n  host: String,\n  message: String): Unit\n

                                                                                                                                                                                                                                                                      workerRemoved is part of the TaskScheduler abstraction.

                                                                                                                                                                                                                                                                      workerRemoved prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                      Handle removed worker [workerId]: [message]\n

                                                                                                                                                                                                                                                                      In the end, workerRemoved requests the DAGScheduler to workerRemoved.

                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#calculateAvailableSlots","title":"calculateAvailableSlots
                                                                                                                                                                                                                                                                      calculateAvailableSlots(\n  scheduler: TaskSchedulerImpl,\n  conf: SparkConf,\n  rpId: Int,\n  availableRPIds: Array[Int],\n  availableCpus: Array[Int],\n  availableResources: Array[Map[String, Int]]): Int\n

                                                                                                                                                                                                                                                                      calculateAvailableSlots...FIXME

                                                                                                                                                                                                                                                                      calculateAvailableSlots is used when:

                                                                                                                                                                                                                                                                      • TaskSchedulerImpl is requested for TaskDescriptions for the given executor resource offers
                                                                                                                                                                                                                                                                      • CoarseGrainedSchedulerBackend is requested for the maximum number of concurrent tasks
                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSchedulerImpl/#logging","title":"Logging

                                                                                                                                                                                                                                                                      Enable ALL logging level for org.apache.spark.scheduler.TaskSchedulerImpl logger to see what happens inside.

                                                                                                                                                                                                                                                                      Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                                      logger.TaskSchedulerImpl.name = org.apache.spark.scheduler.TaskSchedulerImpl\nlogger.TaskSchedulerImpl.level = all\n

                                                                                                                                                                                                                                                                      Refer to Logging.

                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskSet/","title":"TaskSet","text":"

                                                                                                                                                                                                                                                                      TaskSet is a collection of independent tasks of a stage (and a stage execution attempt) that are missing (uncomputed), i.e. for which computation results are unavailable (as RDD blocks on BlockManagers on executors).

                                                                                                                                                                                                                                                                      In other words, a TaskSet represents the missing partitions of a stage that (as tasks) can be run right away based on the data that is already on the cluster, e.g. map output files from previous stages, though they may fail if this data becomes unavailable.

                                                                                                                                                                                                                                                                      Since the tasks are only the missing tasks, their number does not necessarily have to be the number of all the tasks of a stage. For a brand new stage (that has never been attempted to compute) their numbers are exactly the same.

                                                                                                                                                                                                                                                                      Once DAGScheduler submits the missing tasks for execution (to the TaskScheduler), the execution of the TaskSet is managed by a TaskSetManager that allows for spark.task.maxFailures.

                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSet/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                      TaskSet takes the following to be created:

                                                                                                                                                                                                                                                                      • Tasks
                                                                                                                                                                                                                                                                      • Stage ID
                                                                                                                                                                                                                                                                      • Stage (Execution) Attempt ID
                                                                                                                                                                                                                                                                      • FIFO Priority
                                                                                                                                                                                                                                                                      • Local Properties
                                                                                                                                                                                                                                                                      • Resource Profile ID

                                                                                                                                                                                                                                                                        TaskSet is created\u00a0when:

                                                                                                                                                                                                                                                                        • DAGScheduler is requested to submit the missing tasks of a stage
                                                                                                                                                                                                                                                                        "},{"location":"scheduler/TaskSet/#id","title":"ID
                                                                                                                                                                                                                                                                        id: String\n

                                                                                                                                                                                                                                                                        TaskSet is uniquely identified by an id that uses the stageId followed by the stageAttemptId with the comma (.) in-between:

                                                                                                                                                                                                                                                                        [stageId].[stageAttemptId]\n
                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/TaskSet/#textual-representation","title":"Textual Representation
                                                                                                                                                                                                                                                                        toString: String\n

                                                                                                                                                                                                                                                                        toString follows the pattern:

                                                                                                                                                                                                                                                                        TaskSet [stageId].[stageAttemptId]\n
                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/TaskSet/#task-scheduling-prioritization-fifo-scheduling","title":"Task Scheduling Prioritization (FIFO Scheduling)

                                                                                                                                                                                                                                                                        TaskSet is given a priority when created.

                                                                                                                                                                                                                                                                        The priority is the ID of the earliest-created active job that needs the stage (that is given when DAGScheduler is requested to submit the missing tasks of a stage).

                                                                                                                                                                                                                                                                        Once submitted for execution, the priority is the priority of the TaskSetManager (which is a Schedulable) that is used for task prioritization (prioritizing scheduling of tasks) in the FIFO scheduling mode.

                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/TaskSetBlacklist/","title":"TaskSetBlacklist","text":"

                                                                                                                                                                                                                                                                        == [[TaskSetBlacklist]] TaskSetBlacklist -- Blacklisting Executors and Nodes For TaskSet

                                                                                                                                                                                                                                                                        CAUTION: FIXME

                                                                                                                                                                                                                                                                        === [[updateBlacklistForFailedTask]] updateBlacklistForFailedTask Method

                                                                                                                                                                                                                                                                        CAUTION: FIXME

                                                                                                                                                                                                                                                                        === [[isExecutorBlacklistedForTaskSet]] isExecutorBlacklistedForTaskSet Method

                                                                                                                                                                                                                                                                        CAUTION: FIXME

                                                                                                                                                                                                                                                                        === [[isNodeBlacklistedForTaskSet]] isNodeBlacklistedForTaskSet Method

                                                                                                                                                                                                                                                                        CAUTION: FIXME

                                                                                                                                                                                                                                                                        "},{"location":"scheduler/TaskSetManager/","title":"TaskSetManager","text":"

                                                                                                                                                                                                                                                                        TaskSetManager is a Schedulable that manages scheduling the tasks of a TaskSet.

                                                                                                                                                                                                                                                                        "},{"location":"scheduler/TaskSetManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                        TaskSetManager takes the following to be created:

                                                                                                                                                                                                                                                                        • TaskSchedulerImpl
                                                                                                                                                                                                                                                                        • TaskSet
                                                                                                                                                                                                                                                                        • Number of Task Failures
                                                                                                                                                                                                                                                                        • HealthTracker
                                                                                                                                                                                                                                                                        • Clock

                                                                                                                                                                                                                                                                          TaskSetManager is created when:

                                                                                                                                                                                                                                                                          • TaskSchedulerImpl is requested to create a TaskSetManager

                                                                                                                                                                                                                                                                          While being created, TaskSetManager requests the current epoch from MapOutputTracker and sets it on all tasks in the taskset.

                                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                                          TaskSetManager uses TaskSchedulerImpl to access the current MapOutputTracker.

                                                                                                                                                                                                                                                                          TaskSetManager prints out the following DEBUG to the logs:

                                                                                                                                                                                                                                                                          Epoch for [taskSet]: [epoch]\n

                                                                                                                                                                                                                                                                          TaskSetManager adds the tasks as pending execution (in reverse order from the highest partition to the lowest).

                                                                                                                                                                                                                                                                          "},{"location":"scheduler/TaskSetManager/#number-of-task-failures","title":"Number of Task Failures

                                                                                                                                                                                                                                                                          TaskSetManager is given maxTaskFailures value that is how many times a single task can fail before the whole TaskSet is aborted.

                                                                                                                                                                                                                                                                          Master URL Number of Task Failures local 1 local-with-retries maxFailures local-cluster spark.task.maxFailures Cluster Manager spark.task.maxFailures","text":""},{"location":"scheduler/TaskSetManager/#isBarrier","title":"isBarrier","text":"
                                                                                                                                                                                                                                                                          isBarrier: Boolean\n

                                                                                                                                                                                                                                                                          isBarrier is enabled (true) when this TaskSetManager is created for a TaskSet with barrier tasks.

                                                                                                                                                                                                                                                                          isBarrier is used when:

                                                                                                                                                                                                                                                                          • TaskSchedulerImpl is requested to resourceOfferSingleTaskSet, resourceOffers
                                                                                                                                                                                                                                                                          • TaskSetManager is requested to resourceOffer, checkSpeculatableTasks, getLocalityWait
                                                                                                                                                                                                                                                                          "},{"location":"scheduler/TaskSetManager/#resourceOffer","title":"resourceOffer","text":"
                                                                                                                                                                                                                                                                          resourceOffer(\n  execId: String,\n  host: String,\n  maxLocality: TaskLocality.TaskLocality,\n  taskCpus: Int = sched.CPUS_PER_TASK,\n  taskResourceAssignments: Map[String, ResourceInformation] = Map.empty): (Option[TaskDescription], Boolean, Int)\n

                                                                                                                                                                                                                                                                          resourceOffer determines allowed locality level for the given TaskLocality being anything but NO_PREF.

                                                                                                                                                                                                                                                                          resourceOffer dequeueTask for the given execId and host, and the allowed locality level. This may or may not give a TaskDescription.

                                                                                                                                                                                                                                                                          In the end, resourceOffer returns the TaskDescription, hasScheduleDelayReject, and the index of the dequeued task (if any).

                                                                                                                                                                                                                                                                          resourceOffer returns a (None, false, -1) tuple when this TaskSetManager is isZombie or the offer (by the given host or execId) should be ignored (excluded).

                                                                                                                                                                                                                                                                          resourceOffer is used when:

                                                                                                                                                                                                                                                                          • TaskSchedulerImpl is requested to resourceOfferSingleTaskSet
                                                                                                                                                                                                                                                                          "},{"location":"scheduler/TaskSetManager/#locality-wait","title":"Locality Wait
                                                                                                                                                                                                                                                                          getLocalityWait(\n  level: TaskLocality.TaskLocality): Long\n

                                                                                                                                                                                                                                                                          getLocalityWait is 0 for legacyLocalityWaitReset and isBarrier flags enabled.

                                                                                                                                                                                                                                                                          getLocalityWait determines the value of locality wait based on the given TaskLocality.TaskLocality.

                                                                                                                                                                                                                                                                          TaskLocality Configuration Property PROCESS_LOCAL spark.locality.wait.process NODE_LOCAL spark.locality.wait.node RACK_LOCAL spark.locality.wait.rack

                                                                                                                                                                                                                                                                          Unless the value has been determined, getLocalityWait defaults to 0.

                                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                                          NO_PREF and ANY task localities have no locality wait.

                                                                                                                                                                                                                                                                          getLocalityWait is used when:

                                                                                                                                                                                                                                                                          • TaskSetManager is created and recomputes locality preferences
                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskSetManager/#sparkdrivermaxresultsize","title":"spark.driver.maxResultSize

                                                                                                                                                                                                                                                                          TaskSetManager uses spark.driver.maxResultSize configuration property to check available memory for more task results.

                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskSetManager/#recomputing-task-locality-preferences","title":"Recomputing Task Locality Preferences
                                                                                                                                                                                                                                                                          recomputeLocality(): Unit\n

                                                                                                                                                                                                                                                                          If zombie, recomputeLocality does nothing.

                                                                                                                                                                                                                                                                          recomputeLocality recomputes myLocalityLevels, localityWaits and currentLocalityIndex internal registries.

                                                                                                                                                                                                                                                                          recomputeLocality computes locality levels (for scheduled tasks) and saves the result in myLocalityLevels internal registry.

                                                                                                                                                                                                                                                                          recomputeLocality computes localityWaits by determining the locality wait for every locality level in myLocalityLevels.

                                                                                                                                                                                                                                                                          recomputeLocality computes currentLocalityIndex by getLocalityIndex with the previous locality level. If the current locality index is higher than the previous, recomputeLocality recalculates currentLocalityIndex.

                                                                                                                                                                                                                                                                          recomputeLocality is used when:

                                                                                                                                                                                                                                                                          • TaskSetManager is notified about status change in executors (i.e., lost, decommissioned, added)
                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskSetManager/#zombie","title":"Zombie

                                                                                                                                                                                                                                                                          A TaskSetManager is a zombie when all tasks in a taskset have completed successfully (regardless of the number of task attempts), or if the taskset has been aborted.

                                                                                                                                                                                                                                                                          While in zombie state, a TaskSetManager can launch no new tasks and responds with no TaskDescriptions to resourceOffers.

                                                                                                                                                                                                                                                                          A TaskSetManager remains in the zombie state until all tasks have finished running, i.e. to continue to track and account for the running tasks.

                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskSetManager/#computing-locality-levels-for-scheduled-tasks","title":"Computing Locality Levels (for Scheduled Tasks)
                                                                                                                                                                                                                                                                          computeValidLocalityLevels(): Array[TaskLocality.TaskLocality]\n

                                                                                                                                                                                                                                                                          computeValidLocalityLevels computes valid locality levels for tasks that were registered in corresponding registries per locality level.

                                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                                          TaskLocality is a locality preference of a task and can be the most localized PROCESS_LOCAL, NODE_LOCAL through NO_PREF and RACK_LOCAL to ANY.

                                                                                                                                                                                                                                                                          For every pending task (in pendingTasks registry), computeValidLocalityLevels requests the TaskSchedulerImpl for acceptable TaskLocalityies:

                                                                                                                                                                                                                                                                          • For every executor, computeValidLocalityLevels requests the TaskSchedulerImpl to isExecutorAlive and adds PROCESS_LOCAL
                                                                                                                                                                                                                                                                          • For every host, computeValidLocalityLevels requests the TaskSchedulerImpl to hasExecutorsAliveOnHost and adds NODE_LOCAL
                                                                                                                                                                                                                                                                          • For any pending tasks with no locality preference, computeValidLocalityLevels adds NO_PREF
                                                                                                                                                                                                                                                                          • For every rack, computeValidLocalityLevels requests the TaskSchedulerImpl to hasHostAliveOnRack and adds RACK_LOCAL

                                                                                                                                                                                                                                                                          computeValidLocalityLevels always registers ANY task locality level.

                                                                                                                                                                                                                                                                          In the end, computeValidLocalityLevels prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                          Valid locality levels for [taskSet]: [comma-separated levels]\n

                                                                                                                                                                                                                                                                          computeValidLocalityLevels is used when:

                                                                                                                                                                                                                                                                          • TaskSetManager is created and to recomputeLocality
                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskSetManager/#executoradded","title":"executorAdded
                                                                                                                                                                                                                                                                          executorAdded(): Unit\n

                                                                                                                                                                                                                                                                          executorAdded recomputeLocality.

                                                                                                                                                                                                                                                                          executorAdded is used when:

                                                                                                                                                                                                                                                                          • TaskSchedulerImpl is requested to handle resource offers
                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskSetManager/#prepareLaunchingTask","title":"prepareLaunchingTask
                                                                                                                                                                                                                                                                          prepareLaunchingTask(\n  execId: String,\n  host: String,\n  index: Int,\n  taskLocality: TaskLocality.Value,\n  speculative: Boolean,\n  taskCpus: Int,\n  taskResourceAssignments: Map[String, ResourceInformation],\n  launchTime: Long): TaskDescription\n
                                                                                                                                                                                                                                                                          taskResourceAssignments

                                                                                                                                                                                                                                                                          taskResourceAssignments are the resources that are passed in to resourceOffer.

                                                                                                                                                                                                                                                                          prepareLaunchingTask...FIXME

                                                                                                                                                                                                                                                                          prepareLaunchingTask is used when:

                                                                                                                                                                                                                                                                          • TaskSchedulerImpl is requested to resourceOffers
                                                                                                                                                                                                                                                                          • TaskSetManager is requested to resourceOffers
                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskSetManager/#demo","title":"Demo

                                                                                                                                                                                                                                                                          Enable DEBUG logging level for org.apache.spark.scheduler.TaskSchedulerImpl (or org.apache.spark.scheduler.cluster.YarnScheduler for YARN) and org.apache.spark.scheduler.TaskSetManager and execute the following two-stage job to see their low-level innerworkings.

                                                                                                                                                                                                                                                                          A cluster manager is recommended since it gives more task localization choices (with YARN additionally supporting rack localization).

                                                                                                                                                                                                                                                                          $ ./bin/spark-shell \\\n    --master yarn \\\n    --conf spark.ui.showConsoleProgress=false\n\n// Keep # partitions low to keep # messages low\n\nscala> sc.parallelize(0 to 9, 3).groupBy(_ % 3).count\nINFO YarnScheduler: Adding task set 0.0 with 3 tasks\nDEBUG TaskSetManager: Epoch for TaskSet 0.0: 0\nDEBUG TaskSetManager: Valid locality levels for TaskSet 0.0: NO_PREF, ANY\nDEBUG YarnScheduler: parentName: , name: TaskSet_0.0, runningTasks: 0\nINFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.0.2.87, executor 1, partition 0, PROCESS_LOCAL, 7541 bytes)\nINFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.0.2.87, executor 2, partition 1, PROCESS_LOCAL, 7541 bytes)\nDEBUG YarnScheduler: parentName: , name: TaskSet_0.0, runningTasks: 1\nINFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.0.2.87, executor 1, partition 2, PROCESS_LOCAL, 7598 bytes)\nDEBUG YarnScheduler: parentName: , name: TaskSet_0.0, runningTasks: 1\nDEBUG TaskSetManager: No tasks for locality level NO_PREF, so moving to locality level ANY\nINFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 518 ms on 10.0.2.87 (executor 1) (1/3)\nINFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 512 ms on 10.0.2.87 (executor 2) (2/3)\nDEBUG YarnScheduler: parentName: , name: TaskSet_0.0, runningTasks: 0\nINFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 51 ms on 10.0.2.87 (executor 1) (3/3)\nINFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool\nINFO YarnScheduler: Adding task set 1.0 with 3 tasks\nDEBUG TaskSetManager: Epoch for TaskSet 1.0: 1\nDEBUG TaskSetManager: Valid locality levels for TaskSet 1.0: NODE_LOCAL, RACK_LOCAL, ANY\nDEBUG YarnScheduler: parentName: , name: TaskSet_1.0, runningTasks: 0\nINFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 3, 10.0.2.87, executor 2, partition 0, NODE_LOCAL, 7348 bytes)\nINFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 4, 10.0.2.87, executor 1, partition 1, NODE_LOCAL, 7348 bytes)\nDEBUG YarnScheduler: parentName: , name: TaskSet_1.0, runningTasks: 1\nINFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 5, 10.0.2.87, executor 1, partition 2, NODE_LOCAL, 7348 bytes)\nINFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 4) in 130 ms on 10.0.2.87 (executor 1) (1/3)\nDEBUG YarnScheduler: parentName: , name: TaskSet_1.0, runningTasks: 1\nDEBUG TaskSetManager: No tasks for locality level NODE_LOCAL, so moving to locality level RACK_LOCAL\nDEBUG TaskSetManager: No tasks for locality level RACK_LOCAL, so moving to locality level ANY\nINFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 3) in 133 ms on 10.0.2.87 (executor 2) (2/3)\nDEBUG YarnScheduler: parentName: , name: TaskSet_1.0, runningTasks: 0\nINFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 5) in 21 ms on 10.0.2.87 (executor 1) (3/3)\nINFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool\nres0: Long = 3\n
                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskSetManager/#logging","title":"Logging

                                                                                                                                                                                                                                                                          Enable ALL logging level for org.apache.spark.scheduler.TaskSetManager logger to see what happens inside.

                                                                                                                                                                                                                                                                          Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                          log4j.logger.org.apache.spark.scheduler.TaskSetManager=ALL\n

                                                                                                                                                                                                                                                                          Refer to Logging

                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/","title":"Serialization System","text":"

                                                                                                                                                                                                                                                                          Serialization System is a core component of Apache Spark with pluggable serializers for task closures and block data.

                                                                                                                                                                                                                                                                          Serialization System uses SerializerManager to select the Serializer (based on spark.serializer configuration property).

                                                                                                                                                                                                                                                                          "},{"location":"serializer/DeserializationStream/","title":"DeserializationStream","text":"

                                                                                                                                                                                                                                                                          = DeserializationStream

                                                                                                                                                                                                                                                                          DeserializationStream is an abstraction of streams for reading serialized objects.

                                                                                                                                                                                                                                                                          == [[readObject]] readObject Method

                                                                                                                                                                                                                                                                          "},{"location":"serializer/DeserializationStream/#source-scala","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#readobjectt-classtag-t","title":"readObjectT: ClassTag: T","text":"

                                                                                                                                                                                                                                                                          readObject...FIXME

                                                                                                                                                                                                                                                                          readObject is used when...FIXME

                                                                                                                                                                                                                                                                          == [[readKey]] readKey Method

                                                                                                                                                                                                                                                                          "},{"location":"serializer/DeserializationStream/#source-scala_1","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#readkeyt-classtag-t","title":"readKeyT: ClassTag: T","text":"

                                                                                                                                                                                                                                                                          readKey <> representing the key of a key-value record.

                                                                                                                                                                                                                                                                          readKey is used when...FIXME

                                                                                                                                                                                                                                                                          == [[readValue]] readValue Method

                                                                                                                                                                                                                                                                          "},{"location":"serializer/DeserializationStream/#source-scala_2","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#readvaluet-classtag-t","title":"readValueT: ClassTag: T","text":"

                                                                                                                                                                                                                                                                          readValue <> representing the value of a key-value record.

                                                                                                                                                                                                                                                                          readValue is used when...FIXME

                                                                                                                                                                                                                                                                          == [[asIterator]] asIterator Method

                                                                                                                                                                                                                                                                          "},{"location":"serializer/DeserializationStream/#source-scala_3","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#asiterator-iteratorany","title":"asIterator: Iterator[Any]","text":"

                                                                                                                                                                                                                                                                          asIterator...FIXME

                                                                                                                                                                                                                                                                          asIterator is used when...FIXME

                                                                                                                                                                                                                                                                          == [[asKeyValueIterator]] asKeyValueIterator Method

                                                                                                                                                                                                                                                                          "},{"location":"serializer/DeserializationStream/#source-scala_4","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#askeyvalueiterator-iteratorany","title":"asKeyValueIterator: Iterator[Any]","text":"

                                                                                                                                                                                                                                                                          asKeyValueIterator...FIXME

                                                                                                                                                                                                                                                                          asKeyValueIterator is used when...FIXME

                                                                                                                                                                                                                                                                          "},{"location":"serializer/JavaSerializerInstance/","title":"JavaSerializerInstance","text":"

                                                                                                                                                                                                                                                                          JavaSerializerInstance is...FIXME

                                                                                                                                                                                                                                                                          "},{"location":"serializer/KryoSerializer/","title":"KryoSerializer","text":"

                                                                                                                                                                                                                                                                          KryoSerializer is a Serializer that uses the Kryo serialization library.

                                                                                                                                                                                                                                                                          "},{"location":"serializer/KryoSerializer/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                          KryoSerializer takes the following to be created:

                                                                                                                                                                                                                                                                          • SparkConf

                                                                                                                                                                                                                                                                            KryoSerializer is created\u00a0when:

                                                                                                                                                                                                                                                                            • SerializerManager is created
                                                                                                                                                                                                                                                                            • SparkConf is requested to registerKryoClasses
                                                                                                                                                                                                                                                                            • SerializerSupport (Spark SQL) is requested for a SerializerInstance
                                                                                                                                                                                                                                                                            "},{"location":"serializer/KryoSerializer/#useunsafe-flag","title":"useUnsafe Flag

                                                                                                                                                                                                                                                                            KryoSerializer uses the spark.kryo.unsafe configuration property for useUnsafe flag (initialized when KryoSerializer is created).

                                                                                                                                                                                                                                                                            useUnsafe\u00a0is used when KryoSerializer is requested to create the following:

                                                                                                                                                                                                                                                                            • KryoSerializerInstance
                                                                                                                                                                                                                                                                            • KryoOutput
                                                                                                                                                                                                                                                                            ","text":""},{"location":"serializer/KryoSerializer/#creating-new-serializerinstance","title":"Creating New SerializerInstance
                                                                                                                                                                                                                                                                            newInstance(): SerializerInstance\n

                                                                                                                                                                                                                                                                            newInstance\u00a0is part of the Serializer abstraction.

                                                                                                                                                                                                                                                                            newInstance creates a KryoSerializerInstance with this KryoSerializer (and the useUnsafe and usePool flags).

                                                                                                                                                                                                                                                                            ","text":""},{"location":"serializer/KryoSerializer/#newkryooutput","title":"newKryoOutput
                                                                                                                                                                                                                                                                            newKryoOutput(): KryoOutput\n

                                                                                                                                                                                                                                                                            newKryoOutput...FIXME

                                                                                                                                                                                                                                                                            newKryoOutput\u00a0is used when:

                                                                                                                                                                                                                                                                            • KryoSerializerInstance is requested for the output
                                                                                                                                                                                                                                                                            ","text":""},{"location":"serializer/KryoSerializer/#newkryo","title":"newKryo
                                                                                                                                                                                                                                                                            newKryo(): Kryo\n

                                                                                                                                                                                                                                                                            newKryo...FIXME

                                                                                                                                                                                                                                                                            newKryo\u00a0is used when:

                                                                                                                                                                                                                                                                            • KryoSerializer is requested for a KryoFactory
                                                                                                                                                                                                                                                                            • KryoSerializerInstance is requested to borrowKryo
                                                                                                                                                                                                                                                                            ","text":""},{"location":"serializer/KryoSerializer/#kryofactory","title":"KryoFactory
                                                                                                                                                                                                                                                                            factory: KryoFactory\n

                                                                                                                                                                                                                                                                            KryoSerializer creates a KryoFactory lazily (on demand and once only) for internalPool.

                                                                                                                                                                                                                                                                            ","text":""},{"location":"serializer/KryoSerializer/#kryopool","title":"KryoPool

                                                                                                                                                                                                                                                                            KryoSerializer creates a custom KryoPool lazily (on demand and once only).

                                                                                                                                                                                                                                                                            KryoPool is used when:

                                                                                                                                                                                                                                                                            • pool
                                                                                                                                                                                                                                                                            • setDefaultClassLoader
                                                                                                                                                                                                                                                                            ","text":""},{"location":"serializer/KryoSerializer/#supportsrelocationofserializedobjects","title":"supportsRelocationOfSerializedObjects
                                                                                                                                                                                                                                                                            supportsRelocationOfSerializedObjects: Boolean\n

                                                                                                                                                                                                                                                                            supportsRelocationOfSerializedObjects\u00a0is part of the Serializer abstraction.

                                                                                                                                                                                                                                                                            supportsRelocationOfSerializedObjects creates a new SerializerInstance (that is assumed to be a KryoSerializerInstance) and requests it to get the value of the autoReset field.

                                                                                                                                                                                                                                                                            ","text":""},{"location":"serializer/KryoSerializerInstance/","title":"KryoSerializerInstance","text":"

                                                                                                                                                                                                                                                                            KryoSerializerInstance is a SerializerInstance.

                                                                                                                                                                                                                                                                            "},{"location":"serializer/KryoSerializerInstance/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                            KryoSerializerInstance takes the following to be created:

                                                                                                                                                                                                                                                                            • KryoSerializer
                                                                                                                                                                                                                                                                            • useUnsafe flag
                                                                                                                                                                                                                                                                            • usePool flag

                                                                                                                                                                                                                                                                              KryoSerializerInstance is created\u00a0when:

                                                                                                                                                                                                                                                                              • KryoSerializer is requested for a new SerializerInstance
                                                                                                                                                                                                                                                                              "},{"location":"serializer/KryoSerializerInstance/#output","title":"Output

                                                                                                                                                                                                                                                                              KryoSerializerInstance creates Kryo's Output lazily (on demand and once only).

                                                                                                                                                                                                                                                                              KryoSerializerInstance requests the KryoSerializer for a newKryoOutput.

                                                                                                                                                                                                                                                                              output\u00a0is used for serialization.

                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/KryoSerializerInstance/#serialize","title":"serialize
                                                                                                                                                                                                                                                                              serialize[T: ClassTag](\n  t: T): ByteBuffer\n

                                                                                                                                                                                                                                                                              serialize\u00a0is part of the SerializerInstance abstraction.

                                                                                                                                                                                                                                                                              serialize...FIXME

                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/KryoSerializerInstance/#deserialize","title":"deserialize
                                                                                                                                                                                                                                                                              deserialize[T: ClassTag](\n  bytes: ByteBuffer): T\n

                                                                                                                                                                                                                                                                              deserialize\u00a0is part of the SerializerInstance abstraction.

                                                                                                                                                                                                                                                                              deserialize...FIXME

                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/KryoSerializerInstance/#releasing-kryo-instance","title":"Releasing Kryo Instance
                                                                                                                                                                                                                                                                              releaseKryo(\n  kryo: Kryo): Unit\n

                                                                                                                                                                                                                                                                              releaseKryo...FIXME

                                                                                                                                                                                                                                                                              releaseKryo\u00a0is used when:

                                                                                                                                                                                                                                                                              • KryoSerializationStream is requested to close
                                                                                                                                                                                                                                                                              • KryoDeserializationStream is requested to close
                                                                                                                                                                                                                                                                              • KryoSerializerInstance is requested to serialize and deserialize (and getAutoReset)
                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/KryoSerializerInstance/#getautoreset","title":"getAutoReset
                                                                                                                                                                                                                                                                              getAutoReset(): Boolean\n

                                                                                                                                                                                                                                                                              getAutoReset uses Java Reflection to access the value of the autoReset field of the Kryo class.

                                                                                                                                                                                                                                                                              getAutoReset\u00a0is used when:

                                                                                                                                                                                                                                                                              • KryoSerializer is requested for the supportsRelocationOfSerializedObjects flag
                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/SerializationStream/","title":"SerializationStream","text":"

                                                                                                                                                                                                                                                                              SerializationStream is an abstraction of serialized streams for writing out serialized key-value records.

                                                                                                                                                                                                                                                                              "},{"location":"serializer/SerializationStream/#contract","title":"Contract","text":""},{"location":"serializer/SerializationStream/#closing-stream","title":"Closing Stream
                                                                                                                                                                                                                                                                              close(): Unit\n
                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/SerializationStream/#flushing-stream","title":"Flushing Stream
                                                                                                                                                                                                                                                                              flush(): Unit\n

                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                              • UnsafeShuffleWriter is requested to insert a record into a ShuffleExternalSorter
                                                                                                                                                                                                                                                                              • DiskBlockObjectWriter is requested to commitAndGet
                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/SerializationStream/#writing-out-object","title":"Writing Out Object
                                                                                                                                                                                                                                                                              writeObject[T: ClassTag](\n  t: T): SerializationStream\n

                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                              • MemoryStore is requested to putIteratorAsBytes
                                                                                                                                                                                                                                                                              • JavaSerializerInstance is requested to serialize
                                                                                                                                                                                                                                                                              • RequestMessage is requested to serialize (for NettyRpcEnv)
                                                                                                                                                                                                                                                                              • ParallelCollectionPartition is requested to writeObject (for ParallelCollectionRDD)
                                                                                                                                                                                                                                                                              • ReliableRDDCheckpointData is requested to doCheckpoint
                                                                                                                                                                                                                                                                              • TorrentBroadcast is created (and requested to writeBlocks)
                                                                                                                                                                                                                                                                              • RangePartitioner is requested to writeObject
                                                                                                                                                                                                                                                                              • SerializationStream is requested to writeKey, writeValue or writeAll
                                                                                                                                                                                                                                                                              • FileSystemPersistenceEngine is requested to serializeIntoFile (for Spark Standalone's Master)
                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/SerializationStream/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                              • JavaSerializationStream
                                                                                                                                                                                                                                                                              • KryoSerializationStream
                                                                                                                                                                                                                                                                              "},{"location":"serializer/SerializationStream/#writing-out-all-records","title":"Writing Out All Records
                                                                                                                                                                                                                                                                              writeAll[T: ClassTag](\n  iter: Iterator[T]): SerializationStream\n

                                                                                                                                                                                                                                                                              writeAll writes out records of the given iterator (one by one as objects).

                                                                                                                                                                                                                                                                              writeAll is used when:

                                                                                                                                                                                                                                                                              • ReliableCheckpointRDD is requested to doCheckpoint
                                                                                                                                                                                                                                                                              • SerializerManager is requested to dataSerializeStream and dataSerializeWithExplicitClassTag
                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/SerializationStream/#writing-out-key","title":"Writing Out Key
                                                                                                                                                                                                                                                                              writeKey[T: ClassTag](\n  key: T): SerializationStream\n

                                                                                                                                                                                                                                                                              Writes out the key

                                                                                                                                                                                                                                                                              writeKey is used when:

                                                                                                                                                                                                                                                                              • UnsafeShuffleWriter is requested to insert a record into a ShuffleExternalSorter
                                                                                                                                                                                                                                                                              • DiskBlockObjectWriter is requested to write the key and value of a record
                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/SerializationStream/#writing-out-value","title":"Writing Out Value
                                                                                                                                                                                                                                                                              writeValue[T: ClassTag](\n  value: T): SerializationStream\n

                                                                                                                                                                                                                                                                              Writes out the value

                                                                                                                                                                                                                                                                              writeValue is used when:

                                                                                                                                                                                                                                                                              • UnsafeShuffleWriter is requested to insert a record into a ShuffleExternalSorter
                                                                                                                                                                                                                                                                              • DiskBlockObjectWriter is requested to write the key and value of a record
                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/Serializer/","title":"Serializer","text":"

                                                                                                                                                                                                                                                                              Serializer is an abstraction of serializers for serialization and deserialization of tasks (closures) and data blocks in a Spark application.

                                                                                                                                                                                                                                                                              "},{"location":"serializer/Serializer/#contract","title":"Contract","text":""},{"location":"serializer/Serializer/#creating-new-serializerinstance","title":"Creating New SerializerInstance
                                                                                                                                                                                                                                                                              newInstance(): SerializerInstance\n

                                                                                                                                                                                                                                                                              Creates a new SerializerInstance

                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                              • Task is created (only used in tests)
                                                                                                                                                                                                                                                                              • SerializerSupport (Spark SQL) utility is used to newSerializer
                                                                                                                                                                                                                                                                              • RangePartitioner is requested to writeObject and readObject
                                                                                                                                                                                                                                                                              • TorrentBroadcast utility is used to blockifyObject and unBlockifyObject
                                                                                                                                                                                                                                                                              • TaskRunner is requested to run
                                                                                                                                                                                                                                                                              • NettyBlockRpcServer is requested to deserializeMetadata
                                                                                                                                                                                                                                                                              • NettyBlockTransferService is requested to uploadBlock
                                                                                                                                                                                                                                                                              • PairRDDFunctions is requested to...FIXME
                                                                                                                                                                                                                                                                              • ParallelCollectionPartition is requested to...FIXME
                                                                                                                                                                                                                                                                              • RDD is requested to...FIXME
                                                                                                                                                                                                                                                                              • ReliableCheckpointRDD utility is used to...FIXME
                                                                                                                                                                                                                                                                              • NettyRpcEnvFactory is requested to create a RpcEnv
                                                                                                                                                                                                                                                                              • DAGScheduler is created
                                                                                                                                                                                                                                                                              • others
                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/Serializer/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                              • JavaSerializer
                                                                                                                                                                                                                                                                              • KryoSerializer
                                                                                                                                                                                                                                                                              • UnsafeRowSerializer (Spark SQL)
                                                                                                                                                                                                                                                                              "},{"location":"serializer/Serializer/#accessing-serializer","title":"Accessing Serializer","text":"

                                                                                                                                                                                                                                                                              Serializer is available using SparkEnv as the closureSerializer and serializer.

                                                                                                                                                                                                                                                                              "},{"location":"serializer/Serializer/#closureserializer","title":"closureSerializer
                                                                                                                                                                                                                                                                              SparkEnv.get.closureSerializer\n
                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/Serializer/#serializer_1","title":"serializer
                                                                                                                                                                                                                                                                              SparkEnv.get.serializer\n
                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/Serializer/#serialized-objects-relocation-requirements","title":"Serialized Objects Relocation Requirements
                                                                                                                                                                                                                                                                              supportsRelocationOfSerializedObjects: Boolean\n

                                                                                                                                                                                                                                                                              supportsRelocationOfSerializedObjects is disabled (false) by default.

                                                                                                                                                                                                                                                                              supportsRelocationOfSerializedObjects is used when:

                                                                                                                                                                                                                                                                              • BlockStoreShuffleReader is requested to fetchContinuousBlocksInBatch
                                                                                                                                                                                                                                                                              • SortShuffleManager is requested to create a ShuffleHandle for a given ShuffleDependency (and checks out SerializedShuffleHandle requirements)
                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/SerializerInstance/","title":"SerializerInstance","text":"

                                                                                                                                                                                                                                                                              SerializerInstance is an abstraction of serializer instances (for use by one thread at a time).

                                                                                                                                                                                                                                                                              "},{"location":"serializer/SerializerInstance/#contract","title":"Contract","text":""},{"location":"serializer/SerializerInstance/#deserializing-from-bytebuffer","title":"Deserializing (from ByteBuffer)
                                                                                                                                                                                                                                                                              deserialize[T: ClassTag](\n  bytes: ByteBuffer): T\ndeserialize[T: ClassTag](\n  bytes: ByteBuffer,\n  loader: ClassLoader): T\n

                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                              • TaskRunner is requested to run
                                                                                                                                                                                                                                                                              • ResultTask is requested to run
                                                                                                                                                                                                                                                                              • ShuffleMapTask is requested to run
                                                                                                                                                                                                                                                                              • TaskResultGetter is requested to enqueueFailedTask
                                                                                                                                                                                                                                                                              • others
                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/SerializerInstance/#deserializing-from-inputstream","title":"Deserializing (from InputStream)
                                                                                                                                                                                                                                                                              deserializeStream(\n  s: InputStream): DeserializationStream\n
                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/SerializerInstance/#serializing-to-bytebuffer","title":"Serializing (to ByteBuffer)
                                                                                                                                                                                                                                                                              serialize[T: ClassTag](\n  t: T): ByteBuffer\n
                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/SerializerInstance/#serializing-to-outputstream","title":"Serializing (to OutputStream)
                                                                                                                                                                                                                                                                              serializeStream(\n  s: OutputStream): SerializationStream\n
                                                                                                                                                                                                                                                                              ","text":""},{"location":"serializer/SerializerInstance/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                              • JavaSerializerInstance
                                                                                                                                                                                                                                                                              • KryoSerializerInstance
                                                                                                                                                                                                                                                                              • UnsafeRowSerializerInstance (Spark SQL)
                                                                                                                                                                                                                                                                              "},{"location":"serializer/SerializerManager/","title":"SerializerManager","text":"

                                                                                                                                                                                                                                                                              SerializerManager is used to select the Serializer for shuffle blocks.

                                                                                                                                                                                                                                                                              "},{"location":"serializer/SerializerManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                              SerializerManager takes the following to be created:

                                                                                                                                                                                                                                                                              • Default Serializer
                                                                                                                                                                                                                                                                              • SparkConf
                                                                                                                                                                                                                                                                              • (optional) Encryption Key (Option[Array[Byte]])

                                                                                                                                                                                                                                                                                SerializerManager is created\u00a0when:

                                                                                                                                                                                                                                                                                • SparkEnv utility is used to create a SparkEnv (for the driver and executors)
                                                                                                                                                                                                                                                                                "},{"location":"serializer/SerializerManager/#kryo-compatible-types","title":"Kryo-Compatible Types

                                                                                                                                                                                                                                                                                Kryo-Compatible Types are the following primitive types, Arrays of the primitive types and Strings:

                                                                                                                                                                                                                                                                                • Boolean
                                                                                                                                                                                                                                                                                • Byte
                                                                                                                                                                                                                                                                                • Char
                                                                                                                                                                                                                                                                                • Double
                                                                                                                                                                                                                                                                                • Float
                                                                                                                                                                                                                                                                                • Int
                                                                                                                                                                                                                                                                                • Long
                                                                                                                                                                                                                                                                                • Null
                                                                                                                                                                                                                                                                                • Short
                                                                                                                                                                                                                                                                                ","text":""},{"location":"serializer/SerializerManager/#default-serializer","title":"Default Serializer

                                                                                                                                                                                                                                                                                SerializerManager is given a Serializer when created (based on spark.serializer configuration property).

                                                                                                                                                                                                                                                                                The Serializer is used when SerializerManager is requested for a Serializer.

                                                                                                                                                                                                                                                                                Tip

                                                                                                                                                                                                                                                                                Enable DEBUG logging level of SparkEnv to be told about the selected Serializer.

                                                                                                                                                                                                                                                                                Using serializer: [serializer]\n
                                                                                                                                                                                                                                                                                ","text":""},{"location":"serializer/SerializerManager/#accessing-serializermanager","title":"Accessing SerializerManager

                                                                                                                                                                                                                                                                                SerializerManager is available using SparkEnv on the driver and executors.

                                                                                                                                                                                                                                                                                import org.apache.spark.SparkEnv\nSparkEnv.get.serializerManager\n
                                                                                                                                                                                                                                                                                ","text":""},{"location":"serializer/SerializerManager/#kryoserializer","title":"KryoSerializer

                                                                                                                                                                                                                                                                                SerializerManager creates a KryoSerializer when created.

                                                                                                                                                                                                                                                                                KryoSerializer is used as the serializer when the types of a given key and value are Kryo-compatible.

                                                                                                                                                                                                                                                                                ","text":""},{"location":"serializer/SerializerManager/#selecting-serializer","title":"Selecting Serializer
                                                                                                                                                                                                                                                                                getSerializer(\n  ct: ClassTag[_],\n  autoPick: Boolean): Serializer\ngetSerializer(\n  keyClassTag: ClassTag[_],\n  valueClassTag: ClassTag[_]): Serializer\n

                                                                                                                                                                                                                                                                                getSerializer returns the KryoSerializer when the given ClassTags are Kryo-compatible and the autoPick flag is true. Otherwise, getSerializer returns the default Serializer.

                                                                                                                                                                                                                                                                                autoPick flag is true for all BlockIds but Spark Streaming's StreamBlockIds.

                                                                                                                                                                                                                                                                                getSerializer (with autoPick flag) is used when:

                                                                                                                                                                                                                                                                                • SerializerManager is requested to dataSerializeStream, dataSerializeWithExplicitClassTag and dataDeserializeStream
                                                                                                                                                                                                                                                                                • SerializedValuesHolder (of MemoryStore) is requested for a SerializationStream

                                                                                                                                                                                                                                                                                getSerializer (with key and value ClassTags only) is used when:

                                                                                                                                                                                                                                                                                • ShuffledRDD is requested for dependencies
                                                                                                                                                                                                                                                                                ","text":""},{"location":"serializer/SerializerManager/#dataserializestream","title":"dataSerializeStream
                                                                                                                                                                                                                                                                                dataSerializeStream[T: ClassTag](\n  blockId: BlockId,\n  outputStream: OutputStream,\n  values: Iterator[T]): Unit\n

                                                                                                                                                                                                                                                                                dataSerializeStream...FIXME

                                                                                                                                                                                                                                                                                dataSerializeStream\u00a0is used when:

                                                                                                                                                                                                                                                                                • BlockManager is requested to doPutIterator and dropFromMemory
                                                                                                                                                                                                                                                                                ","text":""},{"location":"serializer/SerializerManager/#dataserializewithexplicitclasstag","title":"dataSerializeWithExplicitClassTag
                                                                                                                                                                                                                                                                                dataSerializeWithExplicitClassTag(\n  blockId: BlockId,\n  values: Iterator[_],\n  classTag: ClassTag[_]): ChunkedByteBuffer\n

                                                                                                                                                                                                                                                                                dataSerializeWithExplicitClassTag...FIXME

                                                                                                                                                                                                                                                                                dataSerializeWithExplicitClassTag\u00a0is used when:

                                                                                                                                                                                                                                                                                • BlockManager is requested to doGetLocalBytes
                                                                                                                                                                                                                                                                                • SerializerManager is requested to dataSerialize
                                                                                                                                                                                                                                                                                ","text":""},{"location":"serializer/SerializerManager/#datadeserializestream","title":"dataDeserializeStream
                                                                                                                                                                                                                                                                                dataDeserializeStream[T](\n  blockId: BlockId,\n  inputStream: InputStream)\n  (classTag: ClassTag[T]): Iterator[T]\n

                                                                                                                                                                                                                                                                                dataDeserializeStream...FIXME

                                                                                                                                                                                                                                                                                dataDeserializeStream\u00a0is used when:

                                                                                                                                                                                                                                                                                • BlockStoreUpdater is requested to saveDeserializedValuesToMemoryStore
                                                                                                                                                                                                                                                                                • BlockManager is requested to getLocalValues and getRemoteValues
                                                                                                                                                                                                                                                                                • MemoryStore is requested to putIteratorAsBytes (when PartiallySerializedBlock is requested for a PartiallyUnrolledIterator)
                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/","title":"Shuffle System","text":"

                                                                                                                                                                                                                                                                                Shuffle System is a core service of Apache Spark that is responsible for shuffle blocks.

                                                                                                                                                                                                                                                                                The main core abstraction is ShuffleManager with SortShuffleManager as the default and only known implementation.

                                                                                                                                                                                                                                                                                spark.shuffle.manager configuration property allows for a custom ShuffleManager.

                                                                                                                                                                                                                                                                                Shuffle System uses shuffle handles, readers and writers.

                                                                                                                                                                                                                                                                                "},{"location":"shuffle/#resources","title":"Resources","text":"
                                                                                                                                                                                                                                                                                • Improving Apache Spark Downscaling by Christopher Crosbie (Google) Ben Sidhom (Google)
                                                                                                                                                                                                                                                                                • Spark shuffle introduction by Raymond Liu (aka colorant)
                                                                                                                                                                                                                                                                                "},{"location":"shuffle/BaseShuffleHandle/","title":"BaseShuffleHandle","text":"

                                                                                                                                                                                                                                                                                BaseShuffleHandle is a ShuffleHandle that is used to capture the parameters when SortShuffleManager is requested for a ShuffleHandle (and the other specialized ShuffleHandles could not be selected):

                                                                                                                                                                                                                                                                                • Shuffle ID
                                                                                                                                                                                                                                                                                • ShuffleDependency"},{"location":"shuffle/BaseShuffleHandle/#extensions","title":"Extensions","text":"
                                                                                                                                                                                                                                                                                  • BypassMergeSortShuffleHandle
                                                                                                                                                                                                                                                                                  • SerializedShuffleHandle
                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/BaseShuffleHandle/#demo","title":"Demo","text":"
                                                                                                                                                                                                                                                                                  // Start a Spark application, e.g. spark-shell, with the Spark properties to trigger selection of BaseShuffleHandle:\n// 1. spark.shuffle.spill.numElementsForceSpillThreshold=1\n// 2. spark.shuffle.sort.bypassMergeThreshold=1\n\n// numSlices > spark.shuffle.sort.bypassMergeThreshold\nscala> val rdd = sc.parallelize(0 to 4, numSlices = 2).groupBy(_ % 2)\nrdd: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[2] at groupBy at <console>:24\n\nscala> rdd.dependencies\nDEBUG SortShuffleManager: Can't use serialized shuffle for shuffle 0 because an aggregator is defined\nres0: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.ShuffleDependency@1160c54b)\n\nscala> rdd.getNumPartitions\nres1: Int = 2\n\nscala> import org.apache.spark.ShuffleDependency\nimport org.apache.spark.ShuffleDependency\n\nscala> val shuffleDep = rdd.dependencies(0).asInstanceOf[ShuffleDependency[Int, Int, Int]]\nshuffleDep: org.apache.spark.ShuffleDependency[Int,Int,Int] = org.apache.spark.ShuffleDependency@1160c54b\n\n// mapSideCombine is disabled\nscala> shuffleDep.mapSideCombine\nres2: Boolean = false\n\n// aggregator defined\nscala> shuffleDep.aggregator\nres3: Option[org.apache.spark.Aggregator[Int,Int,Int]] = Some(Aggregator(<function1>,<function2>,<function2>))\n\n// the number of reduce partitions < spark.shuffle.sort.bypassMergeThreshold\nscala> shuffleDep.partitioner.numPartitions\nres4: Int = 2\n\nscala> shuffleDep.shuffleHandle\nres5: org.apache.spark.shuffle.ShuffleHandle = org.apache.spark.shuffle.BaseShuffleHandle@22b0fe7e\n
                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/BlockStoreShuffleReader/","title":"BlockStoreShuffleReader","text":"

                                                                                                                                                                                                                                                                                  BlockStoreShuffleReader is a ShuffleReader.

                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/BlockStoreShuffleReader/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                  BlockStoreShuffleReader takes the following to be created:

                                                                                                                                                                                                                                                                                  • BaseShuffleHandle
                                                                                                                                                                                                                                                                                  • Blocks by Address (Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])])
                                                                                                                                                                                                                                                                                  • TaskContext
                                                                                                                                                                                                                                                                                  • ShuffleReadMetricsReporter
                                                                                                                                                                                                                                                                                  • SerializerManager
                                                                                                                                                                                                                                                                                  • BlockManager
                                                                                                                                                                                                                                                                                  • MapOutputTracker
                                                                                                                                                                                                                                                                                  • shouldBatchFetch flag (default: false)

                                                                                                                                                                                                                                                                                    BlockStoreShuffleReader is created\u00a0when:

                                                                                                                                                                                                                                                                                    • SortShuffleManager is requested for a ShuffleReader (for a ShuffleHandle and a range of reduce partitions)
                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/BlockStoreShuffleReader/#reading-combined-records-for-reduce-task","title":"Reading Combined Records (for Reduce Task)
                                                                                                                                                                                                                                                                                    read(): Iterator[Product2[K, C]]\n

                                                                                                                                                                                                                                                                                    read\u00a0is part of the ShuffleReader abstraction.

                                                                                                                                                                                                                                                                                    read creates a ShuffleBlockFetcherIterator.

                                                                                                                                                                                                                                                                                    read...FIXME

                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/BlockStoreShuffleReader/#fetchcontinuousblocksinbatch","title":"fetchContinuousBlocksInBatch
                                                                                                                                                                                                                                                                                    fetchContinuousBlocksInBatch: Boolean\n

                                                                                                                                                                                                                                                                                    fetchContinuousBlocksInBatch...FIXME

                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/BlockStoreShuffleReader/#review-me","title":"Review Me

                                                                                                                                                                                                                                                                                    === [[read]] Reading Combined Records For Reduce Task

                                                                                                                                                                                                                                                                                    Internally, read first storage:ShuffleBlockFetcherIterator.md#creating-instance[creates a ShuffleBlockFetcherIterator] (passing in the values of <>, <> and <> Spark properties).

                                                                                                                                                                                                                                                                                    NOTE: read uses scheduler:MapOutputTracker.md#getMapSizesByExecutorId[MapOutputTracker to find the BlockManagers with the shuffle blocks and sizes] to create ShuffleBlockFetcherIterator.

                                                                                                                                                                                                                                                                                    read creates a new serializer:SerializerInstance.md[SerializerInstance] (using Serializer from ShuffleDependency).

                                                                                                                                                                                                                                                                                    read creates a key/value iterator by deserializeStream every shuffle block stream.

                                                                                                                                                                                                                                                                                    read updates the context task metrics for each record read.

                                                                                                                                                                                                                                                                                    NOTE: read uses CompletionIterator (to count the records read) and spark-InterruptibleIterator.md[InterruptibleIterator] (to support task cancellation).

                                                                                                                                                                                                                                                                                    If the ShuffleDependency has an Aggregator defined, read wraps the current iterator inside an iterator defined by Aggregator.combineCombinersByKey (for mapSideCombine enabled) or Aggregator.combineValuesByKey otherwise.

                                                                                                                                                                                                                                                                                    NOTE: run reports an exception when ShuffleDependency has no Aggregator defined with mapSideCombine flag enabled.

                                                                                                                                                                                                                                                                                    For keyOrdering defined in the ShuffleDependency, run does the following:

                                                                                                                                                                                                                                                                                    1. shuffle:ExternalSorter.md#creating-instance[Creates an ExternalSorter]
                                                                                                                                                                                                                                                                                    2. shuffle:ExternalSorter.md#insertAll[Inserts all the records] into the ExternalSorter
                                                                                                                                                                                                                                                                                    3. Updates context TaskMetrics
                                                                                                                                                                                                                                                                                    4. Returns a CompletionIterator for the ExternalSorter
                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/BypassMergeSortShuffleHandle/","title":"BypassMergeSortShuffleHandle","text":"

                                                                                                                                                                                                                                                                                    BypassMergeSortShuffleHandle is a BaseShuffleHandle that SortShuffleManager uses when can avoid merge-sorting data (when requested to register a shuffle).

                                                                                                                                                                                                                                                                                    SerializedShuffleHandle tells SortShuffleManager to use BypassMergeSortShuffleWriter when requested for a ShuffleWriter.

                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/BypassMergeSortShuffleHandle/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                    BypassMergeSortShuffleHandle takes the following to be created:

                                                                                                                                                                                                                                                                                    • Shuffle ID
                                                                                                                                                                                                                                                                                    • ShuffleDependency

                                                                                                                                                                                                                                                                                      BypassMergeSortShuffleHandle is created when:

                                                                                                                                                                                                                                                                                      • SortShuffleManager is requested for a ShuffleHandle (for the ShuffleDependency)
                                                                                                                                                                                                                                                                                      "},{"location":"shuffle/BypassMergeSortShuffleHandle/#demo","title":"Demo","text":"
                                                                                                                                                                                                                                                                                      val rdd = sc.parallelize(0 to 8).groupBy(_ % 3)\n\nassert(rdd.dependencies.length == 1)\n\nimport org.apache.spark.ShuffleDependency\nval shuffleDep = rdd.dependencies.head.asInstanceOf[ShuffleDependency[Int, Int, Int]]\n\nassert(shuffleDep.mapSideCombine == false, \"mapSideCombine should be disabled\")\nassert(shuffleDep.aggregator.isDefined)\n
                                                                                                                                                                                                                                                                                      // Use ':paste -raw' mode to paste the code\npackage org.apache.spark\nobject open {\n  import org.apache.spark.SparkContext\n  def bypassMergeThreshold(sc: SparkContext) = {\n    import org.apache.spark.internal.config.SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD\n    sc.getConf.get(SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD)\n  }\n}\n
                                                                                                                                                                                                                                                                                      import org.apache.spark.open\nval bypassMergeThreshold = open.bypassMergeThreshold(sc)\n\nassert(shuffleDep.partitioner.numPartitions < bypassMergeThreshold)\n
                                                                                                                                                                                                                                                                                      import org.apache.spark.shuffle.sort.BypassMergeSortShuffleHandle\n// BypassMergeSortShuffleHandle is private[spark]\n// so the following won't work :(\n// assert(shuffleDep.shuffleHandle.isInstanceOf[BypassMergeSortShuffleHandle[Int, Int]])\nassert(shuffleDep.shuffleHandle.toString.contains(\"BypassMergeSortShuffleHandle\"))\n
                                                                                                                                                                                                                                                                                      "},{"location":"shuffle/BypassMergeSortShuffleWriter/","title":"BypassMergeSortShuffleWriter","text":"

                                                                                                                                                                                                                                                                                      BypassMergeSortShuffleWriter&lt;K, V&gt; is a ShuffleWriter for ShuffleMapTasks to write records into one single shuffle block data file.

                                                                                                                                                                                                                                                                                      "},{"location":"shuffle/BypassMergeSortShuffleWriter/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                      BypassMergeSortShuffleWriter takes the following to be created:

                                                                                                                                                                                                                                                                                      • BlockManager
                                                                                                                                                                                                                                                                                      • BypassMergeSortShuffleHandle (of K keys and V values)
                                                                                                                                                                                                                                                                                      • Map ID
                                                                                                                                                                                                                                                                                      • SparkConf
                                                                                                                                                                                                                                                                                      • ShuffleWriteMetricsReporter
                                                                                                                                                                                                                                                                                      • ShuffleExecutorComponents

                                                                                                                                                                                                                                                                                        BypassMergeSortShuffleWriter is created when:

                                                                                                                                                                                                                                                                                        • SortShuffleManager is requested for a ShuffleWriter (for a BypassMergeSortShuffleHandle)
                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/BypassMergeSortShuffleWriter/#diskblockobjectwriters","title":"DiskBlockObjectWriters
                                                                                                                                                                                                                                                                                        DiskBlockObjectWriter[] partitionWriters\n

                                                                                                                                                                                                                                                                                        BypassMergeSortShuffleWriter uses a DiskBlockObjectWriter per partition (based on the Partitioner).

                                                                                                                                                                                                                                                                                        BypassMergeSortShuffleWriter asserts that no partitionWriters are created while writing out records to a shuffle file.

                                                                                                                                                                                                                                                                                        While writing, BypassMergeSortShuffleWriter requests the BlockManager for as many DiskBlockObjectWriters as there are partitions (in the Partitioner).

                                                                                                                                                                                                                                                                                        While writing, BypassMergeSortShuffleWriter requests the Partitioner for a partition for records (using keys) and finds the per-partition DiskBlockObjectWriter that is requested to write out the partition records. After all records are written out to their shuffle files, the DiskBlockObjectWriters are requested to commitAndGet.

                                                                                                                                                                                                                                                                                        BypassMergeSortShuffleWriter uses the partition writers while writing out partition data and removes references to them (nullify them) in the end.

                                                                                                                                                                                                                                                                                        In other words, after writing out partition data partitionWriters internal registry is null.

                                                                                                                                                                                                                                                                                        partitionWriters internal registry becomes null after BypassMergeSortShuffleWriter has finished:

                                                                                                                                                                                                                                                                                        • Writing out partition data
                                                                                                                                                                                                                                                                                        • Stopping
                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#indexshuffleblockresolver","title":"IndexShuffleBlockResolver

                                                                                                                                                                                                                                                                                        BypassMergeSortShuffleWriter is given a IndexShuffleBlockResolver when created.

                                                                                                                                                                                                                                                                                        BypassMergeSortShuffleWriter uses the IndexShuffleBlockResolver for writing out records (to writeIndexFileAndCommit and getDataFile).

                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#serializer","title":"Serializer

                                                                                                                                                                                                                                                                                        When created, BypassMergeSortShuffleWriter requests the ShuffleDependency (of the given BypassMergeSortShuffleHandle) for the Serializer.

                                                                                                                                                                                                                                                                                        BypassMergeSortShuffleWriter creates a new instance of the Serializer for writing out records.

                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#configuration-properties","title":"Configuration Properties","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#sparkshufflefilebuffer","title":"spark.shuffle.file.buffer

                                                                                                                                                                                                                                                                                        BypassMergeSortShuffleWriter uses spark.shuffle.file.buffer configuration property for...FIXME

                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#sparkfiletransferto","title":"spark.file.transferTo

                                                                                                                                                                                                                                                                                        BypassMergeSortShuffleWriter uses spark.file.transferTo configuration property to control whether to use Java New I/O while writing to a partitioned file.

                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#writing-out-records-to-shuffle-file","title":"Writing Out Records to Shuffle File
                                                                                                                                                                                                                                                                                        void write(\n  Iterator<Product2<K, V>> records)\n

                                                                                                                                                                                                                                                                                        write is part of the ShuffleWriter abstraction.

                                                                                                                                                                                                                                                                                        write creates a new instance of the Serializer.

                                                                                                                                                                                                                                                                                        write initializes the partitionWriters and partitionWriterSegments internal registries (for DiskBlockObjectWriters and FileSegments for every partition, respectively).

                                                                                                                                                                                                                                                                                        write requests the BlockManager for the DiskBlockManager and for every partition write requests it for a shuffle block ID and the file. write creates a DiskBlockObjectWriter for the shuffle block (using the BlockManager). write stores the reference to DiskBlockObjectWriters in the partitionWriters internal registry.

                                                                                                                                                                                                                                                                                        After all DiskBlockObjectWriters are created, write requests the ShuffleWriteMetrics to increment shuffle write time metric.

                                                                                                                                                                                                                                                                                        For every record (a key-value pair), write requests the Partitioner for the partition ID for the key. The partition ID is then used as an index of the partition writer (among the DiskBlockObjectWriters) to write the current record out to a block file.

                                                                                                                                                                                                                                                                                        Once all records have been writted out to their respective block files, write does the following for every DiskBlockObjectWriter:

                                                                                                                                                                                                                                                                                        1. Requests the DiskBlockObjectWriter to commit and return a corresponding FileSegment of the shuffle block

                                                                                                                                                                                                                                                                                        2. Saves the (reference to) FileSegments in the partitionWriterSegments internal registry

                                                                                                                                                                                                                                                                                        3. Requests the DiskBlockObjectWriter to close

                                                                                                                                                                                                                                                                                        Note

                                                                                                                                                                                                                                                                                        At this point, all the records are in shuffle block files on a local disk. The records are split across block files by key.

                                                                                                                                                                                                                                                                                        write requests the IndexShuffleBlockResolver for the shuffle file for the shuffle and the mapDs>>.

                                                                                                                                                                                                                                                                                        write creates a temporary file (based on the name of the shuffle file) and writes all the per-partition shuffle files to it. The size of every per-partition shuffle files is saved as the partitionLengths internal registry.

                                                                                                                                                                                                                                                                                        Note

                                                                                                                                                                                                                                                                                        At this point, all the per-partition shuffle block files are one single map shuffle data file.

                                                                                                                                                                                                                                                                                        write requests the IndexShuffleBlockResolver to write shuffle index and data files for the shuffle and the map IDs (with the partitionLengths and the temporary shuffle output file).

                                                                                                                                                                                                                                                                                        write returns a shuffle map output status (with the shuffle server ID and the partitionLengths).

                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#no-records","title":"No Records

                                                                                                                                                                                                                                                                                        When there is no records to write out, write initializes the partitionLengths internal array (of numPartitions size) with all elements being 0.

                                                                                                                                                                                                                                                                                        write requests the IndexShuffleBlockResolver to write shuffle index and data files, but the difference (compared to when there are records to write) is that the dataTmp argument is simply null.

                                                                                                                                                                                                                                                                                        write sets the internal mapStatus (with the address of BlockManager in use and partitionLengths).

                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#requirements","title":"Requirements

                                                                                                                                                                                                                                                                                        write requires that there are no DiskBlockObjectWriters.

                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#writing-out-partitioned-data","title":"Writing Out Partitioned Data
                                                                                                                                                                                                                                                                                        long[] writePartitionedData(\n  ShuffleMapOutputWriter mapOutputWriter)\n

                                                                                                                                                                                                                                                                                        writePartitionedData makes sure that DiskBlockObjectWriters are available (partitionWriters != null).

                                                                                                                                                                                                                                                                                        For every partition, writePartitionedData takes the partition file (from the FileSegments). Only when the partition file exists, writePartitionedData requests the given ShuffleMapOutputWriter for a ShufflePartitionWriter and writes out the partitioned data. At the end, writePartitionedData deletes the file.

                                                                                                                                                                                                                                                                                        writePartitionedData requests the ShuffleWriteMetricsReporter to increment the write time.

                                                                                                                                                                                                                                                                                        In the end, writePartitionedData requests the ShuffleMapOutputWriter to commitAllPartitions and returns the size of each partition of the output map file.

                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#copying-raw-bytes-between-input-streams","title":"Copying Raw Bytes Between Input Streams
                                                                                                                                                                                                                                                                                        copyStream(\n  in: InputStream,\n  out: OutputStream,\n  closeStreams: Boolean = false,\n  transferToEnabled: Boolean = false): Long\n

                                                                                                                                                                                                                                                                                        copyStream branches off depending on the type of in and out streams, i.e. whether they are both FileInputStream with transferToEnabled input flag is enabled.

                                                                                                                                                                                                                                                                                        If they are both FileInputStream with transferToEnabled enabled, copyStream gets their FileChannels and transfers bytes from the input file to the output file and counts the number of bytes, possibly zero, that were actually transferred.

                                                                                                                                                                                                                                                                                        NOTE: copyStream uses Java's {java-javadoc-url}/java/nio/channels/FileChannel.html[java.nio.channels.FileChannel] to manage file channels.

                                                                                                                                                                                                                                                                                        If either in and out input streams are not FileInputStream or transferToEnabled flag is disabled (default), copyStream reads data from in to write to out and counts the number of bytes written.

                                                                                                                                                                                                                                                                                        copyStream can optionally close in and out streams (depending on the input closeStreams -- disabled by default).

                                                                                                                                                                                                                                                                                        NOTE: Utils.copyStream is used when <> (among other places).

                                                                                                                                                                                                                                                                                        Tip

                                                                                                                                                                                                                                                                                        Visit the official web site of JSR 51: New I/O APIs for the Java Platform and read up on java.nio package.

                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#stopping-shufflewriter","title":"Stopping ShuffleWriter
                                                                                                                                                                                                                                                                                        Option<MapStatus> stop(\n  boolean success)\n

                                                                                                                                                                                                                                                                                        stop...FIXME

                                                                                                                                                                                                                                                                                        stop\u00a0is part of the ShuffleWriter abstraction.

                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#temporary-array-of-partition-lengths","title":"Temporary Array of Partition Lengths
                                                                                                                                                                                                                                                                                        long[] partitionLengths\n

                                                                                                                                                                                                                                                                                        Temporary array of partition lengths after records are written to a shuffle system.

                                                                                                                                                                                                                                                                                        Initialized every time BypassMergeSortShuffleWriter writes out records (before passing it in to IndexShuffleBlockResolver). After IndexShuffleBlockResolver finishes, it is used to initialize mapStatus internal property.

                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#logging","title":"Logging

                                                                                                                                                                                                                                                                                        Enable ALL logging level for org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter logger to see what happens inside.

                                                                                                                                                                                                                                                                                        Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                        log4j.logger.org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter=ALL\n

                                                                                                                                                                                                                                                                                        Refer to Logging.

                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#internal-properties","title":"Internal Properties","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#numpartitions","title":"numPartitions","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#partitionwritersegments","title":"partitionWriterSegments","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#mapstatus","title":"mapStatus

                                                                                                                                                                                                                                                                                        MapStatus that BypassMergeSortShuffleWriter returns when stopped

                                                                                                                                                                                                                                                                                        Initialized every time BypassMergeSortShuffleWriter writes out records.

                                                                                                                                                                                                                                                                                        Used when BypassMergeSortShuffleWriter stops (with success enabled) as a marker if any records were written and returned if they did.

                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/DownloadFileManager/","title":"DownloadFileManager","text":"

                                                                                                                                                                                                                                                                                        DownloadFileManager is an abstraction of file managers that can createTempFile and registerTempFileToClean.

                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/DownloadFileManager/#contract","title":"Contract","text":""},{"location":"shuffle/DownloadFileManager/#createtempfile","title":"createTempFile
                                                                                                                                                                                                                                                                                        DownloadFile createTempFile(\n  TransportConf transportConf)\n

                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                        • DownloadCallback (of OneForOneBlockFetcher) is created
                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/DownloadFileManager/#registertempfiletoclean","title":"registerTempFileToClean
                                                                                                                                                                                                                                                                                        boolean registerTempFileToClean(\n  DownloadFile file)\n

                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                        • DownloadCallback (of OneForOneBlockFetcher) is requested to onComplete
                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/DownloadFileManager/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                        • RemoteBlockDownloadFileManager
                                                                                                                                                                                                                                                                                        • ShuffleBlockFetcherIterator
                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ExecutorDiskUtils/","title":"ExecutorDiskUtils","text":""},{"location":"shuffle/ExternalAppendOnlyMap/","title":"ExternalAppendOnlyMap","text":"

                                                                                                                                                                                                                                                                                        ExternalAppendOnlyMap is a Spillable of SizeTrackers.

                                                                                                                                                                                                                                                                                        ExternalAppendOnlyMap[K, V, C] is a parameterized type of K keys, V values, and C combiner (partial) values.

                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ExternalAppendOnlyMap/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                        ExternalAppendOnlyMap takes the following to be created:

                                                                                                                                                                                                                                                                                        • [[createCombiner]] createCombiner function (V => C)
                                                                                                                                                                                                                                                                                        • [[mergeValue]] mergeValue function ((C, V) => C)
                                                                                                                                                                                                                                                                                        • [[mergeCombiners]] mergeCombiners function ((C, C) => C)
                                                                                                                                                                                                                                                                                        • [[serializer]] Optional serializer:Serializer.md[Serializer] (default: core:SparkEnv.md#serializer[system Serializer])
                                                                                                                                                                                                                                                                                        • [[blockManager]] Optional storage:BlockManager.md[BlockManager] (default: core:SparkEnv.md#blockManager[system BlockManager])
                                                                                                                                                                                                                                                                                        • [[context]] TaskContext
                                                                                                                                                                                                                                                                                        • [[serializerManager]] Optional serializer:SerializerManager.md[SerializerManager] (default: core:SparkEnv.md#serializerManager[system SerializerManager])

                                                                                                                                                                                                                                                                                        ExternalAppendOnlyMap is created when:

                                                                                                                                                                                                                                                                                        • Aggregator is requested to rdd:Aggregator.md#combineValuesByKey[combineValuesByKey] and rdd:Aggregator.md#combineCombinersByKey[combineCombinersByKey]

                                                                                                                                                                                                                                                                                        • CoGroupedRDD is requested to compute a partition

                                                                                                                                                                                                                                                                                        == [[currentMap]] SizeTrackingAppendOnlyMap

                                                                                                                                                                                                                                                                                        ExternalAppendOnlyMap manages a SizeTrackingAppendOnlyMap.

                                                                                                                                                                                                                                                                                        A SizeTrackingAppendOnlyMap is created immediately when ExternalAppendOnlyMap is and every time when <> and <> spilled to disk.

                                                                                                                                                                                                                                                                                        SizeTrackingAppendOnlyMap are dereferenced (nulled) for the memory to be garbage-collected when <> and <>.

                                                                                                                                                                                                                                                                                        SizeTrackingAppendOnlyMap is used when <>, <>, <> and <>.

                                                                                                                                                                                                                                                                                        == [[insertAll]] Inserting All Key-Value Pairs (from Iterator)

                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                        insertAll( entries: Iterator[Product2[K, V]]): Unit

                                                                                                                                                                                                                                                                                        [[insertAll-update-function]] insertAll creates an update function that uses the <> function for an existing value or the <> function for a new value.

                                                                                                                                                                                                                                                                                        For every key-value pair (from the input iterator), insertAll does the following:

                                                                                                                                                                                                                                                                                        • Requests the <> for the estimated size and, if greater than the <<_peakMemoryUsedBytes, _peakMemoryUsedBytes>> metric, updates it.

                                                                                                                                                                                                                                                                                        • shuffle:Spillable.md#maybeSpill[Spills to a disk if necessary] and, if spilled, creates a new <>

                                                                                                                                                                                                                                                                                        • Requests the <> to change value for the current value (with the <> function)

                                                                                                                                                                                                                                                                                        • shuffle:Spillable.md#addElementsRead[Increments the elements read counter]

                                                                                                                                                                                                                                                                                        • === [[insertAll-usage]] Usage

                                                                                                                                                                                                                                                                                          insertAll is used when:

                                                                                                                                                                                                                                                                                          • Aggregator is requested to rdd:Aggregator.md#combineValuesByKey[combineValuesByKey] and rdd:Aggregator.md#combineCombinersByKey[combineCombinersByKey]

                                                                                                                                                                                                                                                                                          • CoGroupedRDD is requested to compute a partition

                                                                                                                                                                                                                                                                                          • ExternalAppendOnlyMap is requested to <>

                                                                                                                                                                                                                                                                                            === [[insertAll-requirements]] Requirements

                                                                                                                                                                                                                                                                                            insertAll throws an IllegalStateException when the <> internal registry is null:"},{"location":"shuffle/ExternalAppendOnlyMap/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"shuffle/ExternalAppendOnlyMap/#cannot-insert-new-elements-into-a-map-after-calling-iterator","title":"Cannot insert new elements into a map after calling iterator","text":"

                                                                                                                                                                                                                                                                                            == [[iterator]] Iterator of \"Combined\" Pairs

                                                                                                                                                                                                                                                                                            "},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_1","title":"[source, scala]","text":""},{"location":"shuffle/ExternalAppendOnlyMap/#iterator-iteratork-c","title":"iterator: Iterator[(K, C)]","text":"

                                                                                                                                                                                                                                                                                            iterator...FIXME

                                                                                                                                                                                                                                                                                            iterator is used when...FIXME

                                                                                                                                                                                                                                                                                            == [[spill]] Spilling to Disk if Necessary

                                                                                                                                                                                                                                                                                            "},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                            spill( collection: SizeTracker): Unit

                                                                                                                                                                                                                                                                                            spill...FIXME

                                                                                                                                                                                                                                                                                            spill is used when...FIXME

                                                                                                                                                                                                                                                                                            == [[forceSpill]] Forcing Disk Spilling

                                                                                                                                                                                                                                                                                            "},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_3","title":"[source, scala]","text":""},{"location":"shuffle/ExternalAppendOnlyMap/#forcespill-boolean","title":"forceSpill(): Boolean","text":"

                                                                                                                                                                                                                                                                                            forceSpill returns a flag to indicate whether spilling to disk has really happened (true) or not (false).

                                                                                                                                                                                                                                                                                            forceSpill branches off per the current state it is in (and should rather use a state-aware implementation).

                                                                                                                                                                                                                                                                                            When a <> is in use, forceSpill requests it to spill and, if it did, dereferences (nullify) the <>. forceSpill returns whatever the spilling of the <> returned.

                                                                                                                                                                                                                                                                                            When there is at least one element in the <>, forceSpill <> it. forceSpill then creates a new <> and always returns true.

                                                                                                                                                                                                                                                                                            In other cases, forceSpill simply returns false.

                                                                                                                                                                                                                                                                                            forceSpill is part of the shuffle:Spillable.md[Spillable] abstraction.

                                                                                                                                                                                                                                                                                            == [[freeCurrentMap]] Freeing Up SizeTrackingAppendOnlyMap and Releasing Memory

                                                                                                                                                                                                                                                                                            "},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_4","title":"[source, scala]","text":""},{"location":"shuffle/ExternalAppendOnlyMap/#freecurrentmap-unit","title":"freeCurrentMap(): Unit","text":"

                                                                                                                                                                                                                                                                                            freeCurrentMap dereferences (nullify) the <> (if there still was one) followed by shuffle:Spillable.md#releaseMemory[releasing all memory].

                                                                                                                                                                                                                                                                                            freeCurrentMap is used when SpillableIterator is requested to destroy itself.

                                                                                                                                                                                                                                                                                            == [[spillMemoryIteratorToDisk]] spillMemoryIteratorToDisk Method

                                                                                                                                                                                                                                                                                            "},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_5","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                            spillMemoryIteratorToDisk( inMemoryIterator: Iterator[(K, C)]): DiskMapIterator

                                                                                                                                                                                                                                                                                            spillMemoryIteratorToDisk...FIXME

                                                                                                                                                                                                                                                                                            spillMemoryIteratorToDisk is used when...FIXME

                                                                                                                                                                                                                                                                                            "},{"location":"shuffle/ExternalSorter/","title":"ExternalSorter","text":"

                                                                                                                                                                                                                                                                                            ExternalSorter is a Spillable of WritablePartitionedPairCollection of pairs (of K keys and C values).

                                                                                                                                                                                                                                                                                            ExternalSorter[K, V, C] is a parameterized type of K keys, V values, and C combiner (partial) values.

                                                                                                                                                                                                                                                                                            ExternalSorter is used for the following:

                                                                                                                                                                                                                                                                                            • SortShuffleWriter to write records
                                                                                                                                                                                                                                                                                            • BlockStoreShuffleReader to read records (with a key ordering defined)
                                                                                                                                                                                                                                                                                            "},{"location":"shuffle/ExternalSorter/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                            ExternalSorter takes the following to be created:

                                                                                                                                                                                                                                                                                            • TaskContext
                                                                                                                                                                                                                                                                                            • Optional Aggregator (default: undefined)
                                                                                                                                                                                                                                                                                            • Optional Partitioner (default: undefined)
                                                                                                                                                                                                                                                                                            • Optional Ordering (Scala) for keys (default: undefined)
                                                                                                                                                                                                                                                                                            • Serializer (default: Serializer)

                                                                                                                                                                                                                                                                                              ExternalSorter is created\u00a0when:

                                                                                                                                                                                                                                                                                              • BlockStoreShuffleReader is requested to read records (for a reduce task)
                                                                                                                                                                                                                                                                                              • SortShuffleWriter is requested to write records (as a ExternalSorter[K, V, C] or ExternalSorter[K, V, V] based on Map-Size Partial Aggregation Flag)
                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ExternalSorter/#inserting-records","title":"Inserting Records
                                                                                                                                                                                                                                                                                              insertAll(\n  records: Iterator[Product2[K, V]]): Unit\n

                                                                                                                                                                                                                                                                                              insertAll branches off per whether the optional Aggregator was specified or not (when creating the ExternalSorter).

                                                                                                                                                                                                                                                                                              insertAll takes all records eagerly and materializes the given records iterator.

                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ExternalSorter/#map-side-aggregator-specified","title":"Map-Side Aggregator Specified

                                                                                                                                                                                                                                                                                              With an Aggregator given, insertAll creates an update function based on the mergeValue and createCombiner functions of the Aggregator.

                                                                                                                                                                                                                                                                                              For every record, insertAll increment internal read counter.

                                                                                                                                                                                                                                                                                              insertAll requests the PartitionedAppendOnlyMap to changeValue for the key (made up of the partition of the key of the current record and the key itself, i.e. (partition, key)) with the update function.

                                                                                                                                                                                                                                                                                              In the end, insertAll spills the in-memory collection to disk if needed with the usingMap flag enabled (to indicate that the PartitionedAppendOnlyMap was updated).

                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ExternalSorter/#no-map-side-aggregator-specified","title":"No Map-Side Aggregator Specified

                                                                                                                                                                                                                                                                                              With no Aggregator given, insertAll iterates over all the records and uses the PartitionedPairBuffer instead.

                                                                                                                                                                                                                                                                                              For every record, insertAll increment internal read counter.

                                                                                                                                                                                                                                                                                              insertAll requests the PartitionedPairBuffer to insert with the partition of the key of the current record, the key itself and the value of the current record.

                                                                                                                                                                                                                                                                                              In the end, insertAll spills the in-memory collection to disk if needed with the usingMap flag disabled (since this time the PartitionedPairBuffer was updated, not the PartitionedAppendOnlyMap).

                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ExternalSorter/#spilling-in-memory-collection-to-disk","title":"Spilling In-Memory Collection to Disk
                                                                                                                                                                                                                                                                                              maybeSpillCollection(\n  usingMap: Boolean): Unit\n

                                                                                                                                                                                                                                                                                              maybeSpillCollection branches per the input usingMap flag (to indicate which in-memory collection to use, the PartitionedAppendOnlyMap or the PartitionedPairBuffer).

                                                                                                                                                                                                                                                                                              maybeSpillCollection requests the collection to estimate size (in bytes) that is tracked as the peakMemoryUsedBytes metric (for every size bigger than what is currently recorded).

                                                                                                                                                                                                                                                                                              maybeSpillCollection spills the collection to disk if needed. If spilled, maybeSpillCollection creates a new collection (a new PartitionedAppendOnlyMap or a new PartitionedPairBuffer).

                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ExternalSorter/#usage","title":"Usage

                                                                                                                                                                                                                                                                                              insertAll is used when:

                                                                                                                                                                                                                                                                                              • SortShuffleWriter is requested to write records (as a ExternalSorter[K, V, C] or ExternalSorter[K, V, V] based on Map-Size Partial Aggregation Flag)
                                                                                                                                                                                                                                                                                              • BlockStoreShuffleReader is requested to read records (with a key sorting defined)
                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ExternalSorter/#in-memory-collections-of-records","title":"In-Memory Collections of Records

                                                                                                                                                                                                                                                                                              ExternalSorter uses PartitionedPairBuffers or PartitionedAppendOnlyMaps to store records in memory before spilling to disk.

                                                                                                                                                                                                                                                                                              ExternalSorter uses PartitionedPairBuffers when created with no Aggregator. Otherwise, ExternalSorter uses PartitionedAppendOnlyMaps.

                                                                                                                                                                                                                                                                                              ExternalSorter inserts records to the collections when insertAll.

                                                                                                                                                                                                                                                                                              ExternalSorter spills the in-memory collection to disk if needed and, if so, creates a new collection.

                                                                                                                                                                                                                                                                                              ExternalSorter releases the collections (nulls them) when requested to forceSpill and stop. That is when the JVM garbage collector takes care of evicting them from memory completely.

                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ExternalSorter/#peak-size-of-in-memory-collection","title":"Peak Size of In-Memory Collection

                                                                                                                                                                                                                                                                                              ExternalSorter tracks the peak size (in bytes) of the in-memory collection whenever requested to spill the in-memory collection to disk if needed.

                                                                                                                                                                                                                                                                                              The peak size is used when:

                                                                                                                                                                                                                                                                                              • BlockStoreShuffleReader is requested to read combined records for a reduce task (with an ordering defined)
                                                                                                                                                                                                                                                                                              • ExternalSorter is requested to writePartitionedMapOutput
                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ExternalSorter/#spills","title":"Spills
                                                                                                                                                                                                                                                                                              spills: ArrayBuffer[SpilledFile]\n

                                                                                                                                                                                                                                                                                              ExternalSorter creates the spills internal buffer of SpilledFiles when created.

                                                                                                                                                                                                                                                                                              A new SpilledFile is added when ExternalSorter is requested to spill.

                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                              No elements in spills indicate that there is only in-memory data.

                                                                                                                                                                                                                                                                                              SpilledFiles are deleted physically from disk and the spills buffer is cleared when ExternalSorter is requested to stop.

                                                                                                                                                                                                                                                                                              ExternalSorter uses the spills buffer when requested for an partitionedIterator.

                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ExternalSorter/#number-of-spills","title":"Number of Spills
                                                                                                                                                                                                                                                                                              numSpills: Int\n

                                                                                                                                                                                                                                                                                              numSpills is the number of spill files this sorter has spilled.

                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ExternalSorter/#spilledfile","title":"SpilledFile

                                                                                                                                                                                                                                                                                              SpilledFile is a metadata of a spilled file:

                                                                                                                                                                                                                                                                                              • File (Java)
                                                                                                                                                                                                                                                                                              • BlockId
                                                                                                                                                                                                                                                                                              • Serializer Batch Sizes (Array[Long])
                                                                                                                                                                                                                                                                                              • Elements per Partition (Array[Long])","text":""},{"location":"shuffle/ExternalSorter/#spilling-data-to-disk","title":"Spilling Data to Disk
                                                                                                                                                                                                                                                                                                spill(\n  collection: WritablePartitionedPairCollection[K, C]): Unit\n

                                                                                                                                                                                                                                                                                                spill is part of the Spillable abstraction.

                                                                                                                                                                                                                                                                                                spill requests the given WritablePartitionedPairCollection for a destructive WritablePartitionedIterator.

                                                                                                                                                                                                                                                                                                spill spillMemoryIteratorToDisk (with the destructive WritablePartitionedIterator) that creates a SpilledFile.

                                                                                                                                                                                                                                                                                                In the end, spill adds the SpilledFile to the spills internal registry.

                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/ExternalSorter/#spillmemoryiteratortodisk","title":"spillMemoryIteratorToDisk
                                                                                                                                                                                                                                                                                                spillMemoryIteratorToDisk(\n  inMemoryIterator: WritablePartitionedIterator): SpilledFile\n

                                                                                                                                                                                                                                                                                                spillMemoryIteratorToDisk...FIXME

                                                                                                                                                                                                                                                                                                spillMemoryIteratorToDisk is used when:

                                                                                                                                                                                                                                                                                                • ExternalSorter is requested to spill
                                                                                                                                                                                                                                                                                                • SpillableIterator is requested to spill
                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/ExternalSorter/#partitionediterator","title":"partitionedIterator
                                                                                                                                                                                                                                                                                                partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])]\n

                                                                                                                                                                                                                                                                                                partitionedIterator...FIXME

                                                                                                                                                                                                                                                                                                partitionedIterator is used when:

                                                                                                                                                                                                                                                                                                • ExternalSorter is requested for an iterator and to writePartitionedMapOutput
                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/ExternalSorter/#writepartitionedmapoutput","title":"writePartitionedMapOutput
                                                                                                                                                                                                                                                                                                writePartitionedMapOutput(\n  shuffleId: Int,\n  mapId: Long,\n  mapOutputWriter: ShuffleMapOutputWriter): Unit\n

                                                                                                                                                                                                                                                                                                writePartitionedMapOutput...FIXME

                                                                                                                                                                                                                                                                                                writePartitionedMapOutput is used when:

                                                                                                                                                                                                                                                                                                • SortShuffleWriter is requested to write records
                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/ExternalSorter/#iterator","title":"Iterator
                                                                                                                                                                                                                                                                                                iterator: Iterator[Product2[K, C]]\n

                                                                                                                                                                                                                                                                                                iterator turns the isShuffleSort flag off (false).

                                                                                                                                                                                                                                                                                                iterator partitionedIterator and takes the combined values (the second elements) only.

                                                                                                                                                                                                                                                                                                iterator is used when:

                                                                                                                                                                                                                                                                                                • BlockStoreShuffleReader is requested to read combined records for a reduce task
                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/ExternalSorter/#stopping-externalsorter","title":"Stopping ExternalSorter
                                                                                                                                                                                                                                                                                                stop(): Unit\n

                                                                                                                                                                                                                                                                                                stop...FIXME

                                                                                                                                                                                                                                                                                                stop is used when:

                                                                                                                                                                                                                                                                                                • BlockStoreShuffleReader is requested to read records (with ordering defined)
                                                                                                                                                                                                                                                                                                • SortShuffleWriter is requested to stop
                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/ExternalSorter/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                Enable ALL logging level for org.apache.spark.util.collection.ExternalSorter logger to see what happens inside.

                                                                                                                                                                                                                                                                                                Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                log4j.logger.org.apache.spark.util.collection.ExternalSorter=ALL\n

                                                                                                                                                                                                                                                                                                Refer to Logging.

                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/FetchFailedException/","title":"FetchFailedException","text":"

                                                                                                                                                                                                                                                                                                FetchFailedException exception may be thrown when a task runs (and ShuffleBlockFetcherIterator could not fetch shuffle blocks).

                                                                                                                                                                                                                                                                                                When FetchFailedException is reported, TaskRunner catches it and notifies the ExecutorBackend (with TaskState.FAILED task state).

                                                                                                                                                                                                                                                                                                "},{"location":"shuffle/FetchFailedException/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                FetchFailedException takes the following to be created:

                                                                                                                                                                                                                                                                                                • BlockManagerId
                                                                                                                                                                                                                                                                                                • Shuffle ID
                                                                                                                                                                                                                                                                                                • Map ID
                                                                                                                                                                                                                                                                                                • Map Index
                                                                                                                                                                                                                                                                                                • Reduce ID
                                                                                                                                                                                                                                                                                                • Error Message
                                                                                                                                                                                                                                                                                                • Error Cause
                                                                                                                                                                                                                                                                                                • While being created, FetchFailedException requests the current TaskContext to setFetchFailed.

                                                                                                                                                                                                                                                                                                  FetchFailedException is created\u00a0when:

                                                                                                                                                                                                                                                                                                  • ShuffleBlockFetcherIterator is requested to throw a FetchFailedException (for a ShuffleBlockId or a ShuffleBlockBatchId)
                                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/FetchFailedException/#error-cause","title":"Error Cause

                                                                                                                                                                                                                                                                                                  FetchFailedException can be given an error cause when created.

                                                                                                                                                                                                                                                                                                  The root cause of the FetchFailedException is usually because the Executor (with the BlockManager for requested shuffle blocks) is lost and no longer available due to the following:

                                                                                                                                                                                                                                                                                                  1. OutOfMemoryError could be thrown (aka OOMed) or some other unhandled exception
                                                                                                                                                                                                                                                                                                  2. The cluster manager that manages the workers with the executors of your Spark application (e.g. Kubernetes, Hadoop YARN) enforces the container memory limits and eventually decides to kill the executor due to excessive memory usage

                                                                                                                                                                                                                                                                                                  A solution is usually to tune the memory of your Spark application.

                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/FetchFailedException/#taskcontext","title":"TaskContext

                                                                                                                                                                                                                                                                                                  TaskContext comes with setFetchFailed and fetchFailed to hold a FetchFailedException unmodified (regardless of what happens in a user code).

                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/","title":"IndexShuffleBlockResolver","text":"

                                                                                                                                                                                                                                                                                                  IndexShuffleBlockResolver is a ShuffleBlockResolver that manages shuffle block data and uses shuffle index files for faster shuffle data access.

                                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/IndexShuffleBlockResolver/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                  IndexShuffleBlockResolver takes the following to be created:

                                                                                                                                                                                                                                                                                                  • SparkConf
                                                                                                                                                                                                                                                                                                  • BlockManager

                                                                                                                                                                                                                                                                                                    IndexShuffleBlockResolver is created\u00a0when:

                                                                                                                                                                                                                                                                                                    • SortShuffleManager is created
                                                                                                                                                                                                                                                                                                    • LocalDiskShuffleExecutorComponents is requested to initializeExecutor

                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/IndexShuffleBlockResolver/#getstoredshuffles","title":"getStoredShuffles
                                                                                                                                                                                                                                                                                                    getStoredShuffles(): Seq[ShuffleBlockInfo]\n

                                                                                                                                                                                                                                                                                                    getStoredShuffles\u00a0is part of the MigratableResolver abstraction.

                                                                                                                                                                                                                                                                                                    getStoredShuffles...FIXME

                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#putshuffleblockasstream","title":"putShuffleBlockAsStream
                                                                                                                                                                                                                                                                                                    putShuffleBlockAsStream(\n  blockId: BlockId,\n  serializerManager: SerializerManager): StreamCallbackWithID\n

                                                                                                                                                                                                                                                                                                    putShuffleBlockAsStream\u00a0is part of the MigratableResolver abstraction.

                                                                                                                                                                                                                                                                                                    putShuffleBlockAsStream...FIXME

                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#getmigrationblocks","title":"getMigrationBlocks
                                                                                                                                                                                                                                                                                                    getMigrationBlocks(\n  shuffleBlockInfo: ShuffleBlockInfo): List[(BlockId, ManagedBuffer)]\n

                                                                                                                                                                                                                                                                                                    getMigrationBlocks\u00a0is part of the MigratableResolver abstraction.

                                                                                                                                                                                                                                                                                                    getMigrationBlocks...FIXME

                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#writing-shuffle-index-and-data-files","title":"Writing Shuffle Index and Data Files
                                                                                                                                                                                                                                                                                                    writeIndexFileAndCommit(\n  shuffleId: Int,\n  mapId: Long,\n  lengths: Array[Long],\n  dataTmp: File): Unit\n

                                                                                                                                                                                                                                                                                                    writeIndexFileAndCommit finds the index and data files for the input shuffleId and mapId.

                                                                                                                                                                                                                                                                                                    writeIndexFileAndCommit creates a temporary file for the index file (in the same directory) and writes offsets (as the moving sum of the input lengths starting from 0 to the final offset at the end for the end of the output file).

                                                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                                                    The offsets are the sizes in the input lengths exactly.

                                                                                                                                                                                                                                                                                                    writeIndexFileAndCommit...FIXME (Review me)

                                                                                                                                                                                                                                                                                                    writeIndexFileAndCommit <> for the input shuffleId and mapId.

                                                                                                                                                                                                                                                                                                    writeIndexFileAndCommit <> (aka consistency check).

                                                                                                                                                                                                                                                                                                    If the consistency check fails, it means that another attempt for the same task has already written the map outputs successfully and so the input dataTmp and temporary index files are deleted (as no longer correct).

                                                                                                                                                                                                                                                                                                    If the consistency check succeeds, the existing index and data files are deleted (if they exist) and the temporary index and data files become \"official\", i.e. renamed to their final names.

                                                                                                                                                                                                                                                                                                    In case of any IO-related exception, writeIndexFileAndCommit throws a IOException with the messages:

                                                                                                                                                                                                                                                                                                    fail to rename file [indexTmp] to [indexFile]\n

                                                                                                                                                                                                                                                                                                    or

                                                                                                                                                                                                                                                                                                    fail to rename file [dataTmp] to [dataFile]\n

                                                                                                                                                                                                                                                                                                    writeIndexFileAndCommit\u00a0is used when:

                                                                                                                                                                                                                                                                                                    • LocalDiskShuffleMapOutputWriter is requested to commitAllPartitions
                                                                                                                                                                                                                                                                                                    • LocalDiskSingleSpillMapOutputWriter is requested to transferMapSpillFile
                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#removing-shuffle-index-and-data-files","title":"Removing Shuffle Index and Data Files
                                                                                                                                                                                                                                                                                                    removeDataByMap(\n  shuffleId: Int,\n  mapId: Long): Unit\n

                                                                                                                                                                                                                                                                                                    removeDataByMap finds and deletes the shuffle data file (for the input shuffleId and mapId) followed by finding and deleting the shuffle data index file.

                                                                                                                                                                                                                                                                                                    removeDataByMap\u00a0is used when:

                                                                                                                                                                                                                                                                                                    • SortShuffleManager is requested to unregister a shuffle (and remove a shuffle from a shuffle system)
                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#creating-shuffle-block-index-file","title":"Creating Shuffle Block Index File
                                                                                                                                                                                                                                                                                                    getIndexFile(\n  shuffleId: Int,\n  mapId: Long,\n  dirs: Option[Array[String]] = None): File\n

                                                                                                                                                                                                                                                                                                    getIndexFile creates a ShuffleIndexBlockId.

                                                                                                                                                                                                                                                                                                    With dirs local directories defined, getIndexFile places the index file of the ShuffleIndexBlockId (by the name) in the local directories (with the spark.diskStore.subDirectories).

                                                                                                                                                                                                                                                                                                    Otherwise, with no local directories, getIndexFile requests the DiskBlockManager (of the BlockManager) to get the data file.

                                                                                                                                                                                                                                                                                                    getIndexFile\u00a0is used when:

                                                                                                                                                                                                                                                                                                    • IndexShuffleBlockResolver is requested to getBlockData, removeDataByMap, putShuffleBlockAsStream, getMigrationBlocks, writeIndexFileAndCommit
                                                                                                                                                                                                                                                                                                    • FallbackStorage is requested to copy
                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#creating-shuffle-block-data-file","title":"Creating Shuffle Block Data File
                                                                                                                                                                                                                                                                                                    getDataFile(\n  shuffleId: Int,\n  mapId: Long): File // (1)\ngetDataFile(\n  shuffleId: Int,\n  mapId: Long,\n  dirs: Option[Array[String]]): File\n
                                                                                                                                                                                                                                                                                                    1. dirs is None (undefined)

                                                                                                                                                                                                                                                                                                    getDataFile creates a ShuffleDataBlockId.

                                                                                                                                                                                                                                                                                                    With dirs local directories defined, getDataFile places the data file of the ShuffleDataBlockId (by the name) in the local directories (with the spark.diskStore.subDirectories).

                                                                                                                                                                                                                                                                                                    Otherwise, with no local directories, getDataFile requests the DiskBlockManager (of the BlockManager) to get the data file.

                                                                                                                                                                                                                                                                                                    getDataFile\u00a0is used when:

                                                                                                                                                                                                                                                                                                    • IndexShuffleBlockResolver is requested to getBlockData, removeDataByMap, putShuffleBlockAsStream, getMigrationBlocks, writeIndexFileAndCommit
                                                                                                                                                                                                                                                                                                    • LocalDiskShuffleMapOutputWriter is created
                                                                                                                                                                                                                                                                                                    • LocalDiskSingleSpillMapOutputWriter is requested to transferMapSpillFile
                                                                                                                                                                                                                                                                                                    • FallbackStorage is requested to copy
                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#creating-managedbuffer-to-read-shuffle-block-data-file","title":"Creating ManagedBuffer to Read Shuffle Block Data File
                                                                                                                                                                                                                                                                                                    getBlockData(\n  blockId: BlockId,\n  dirs: Option[Array[String]]): ManagedBuffer\n

                                                                                                                                                                                                                                                                                                    getBlockData\u00a0is part of the ShuffleBlockResolver abstraction.

                                                                                                                                                                                                                                                                                                    getBlockData...FIXME

                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#checking-consistency-of-shuffle-index-and-data-files","title":"Checking Consistency of Shuffle Index and Data Files
                                                                                                                                                                                                                                                                                                    checkIndexAndDataFile(\n  index: File,\n  data: File,\n  blocks: Int): Array[Long]\n

                                                                                                                                                                                                                                                                                                    Danger

                                                                                                                                                                                                                                                                                                    Review Me

                                                                                                                                                                                                                                                                                                    checkIndexAndDataFile first checks if the size of the input index file is exactly the input blocks multiplied by 8.

                                                                                                                                                                                                                                                                                                    checkIndexAndDataFile returns null when the numbers, and hence the shuffle index and data files, don't match.

                                                                                                                                                                                                                                                                                                    checkIndexAndDataFile reads the shuffle index file and converts the offsets into lengths of each block.

                                                                                                                                                                                                                                                                                                    checkIndexAndDataFile makes sure that the size of the input shuffle data file is exactly the sum of the block lengths.

                                                                                                                                                                                                                                                                                                    checkIndexAndDataFile returns the block lengths if the numbers match, and null otherwise.

                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#transportconf","title":"TransportConf

                                                                                                                                                                                                                                                                                                    IndexShuffleBlockResolver creates a TransportConf (for shuffle module) when created.

                                                                                                                                                                                                                                                                                                    transportConf\u00a0is used in getMigrationBlocks and getBlockData.

                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                    Enable ALL logging level for org.apache.spark.shuffle.IndexShuffleBlockResolver logger to see what happens inside.

                                                                                                                                                                                                                                                                                                    Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                    log4j.logger.org.apache.spark.shuffle.IndexShuffleBlockResolver=ALL\n

                                                                                                                                                                                                                                                                                                    Refer to Logging.

                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/LocalDiskShuffleDataIO/","title":"LocalDiskShuffleDataIO","text":"

                                                                                                                                                                                                                                                                                                    LocalDiskShuffleDataIO is a ShuffleDataIO.

                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/LocalDiskShuffleDataIO/#shuffleexecutorcomponents","title":"ShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                    ShuffleExecutorComponents executor()\n

                                                                                                                                                                                                                                                                                                    executor\u00a0is part of the ShuffleDataIO abstraction.

                                                                                                                                                                                                                                                                                                    executor creates a new LocalDiskShuffleExecutorComponents.

                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/LocalDiskShuffleExecutorComponents/","title":"LocalDiskShuffleExecutorComponents","text":"

                                                                                                                                                                                                                                                                                                    LocalDiskShuffleExecutorComponents is a ShuffleExecutorComponents.

                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/LocalDiskShuffleExecutorComponents/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                    LocalDiskShuffleExecutorComponents takes the following to be created:

                                                                                                                                                                                                                                                                                                    • SparkConf

                                                                                                                                                                                                                                                                                                      LocalDiskShuffleExecutorComponents is created\u00a0when:

                                                                                                                                                                                                                                                                                                      • LocalDiskShuffleDataIO is requested for a ShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                      "},{"location":"shuffle/LocalDiskShuffleMapOutputWriter/","title":"LocalDiskShuffleMapOutputWriter","text":"

                                                                                                                                                                                                                                                                                                      LocalDiskShuffleMapOutputWriter is...FIXME

                                                                                                                                                                                                                                                                                                      "},{"location":"shuffle/LocalDiskSingleSpillMapOutputWriter/","title":"LocalDiskSingleSpillMapOutputWriter","text":"

                                                                                                                                                                                                                                                                                                      LocalDiskSingleSpillMapOutputWriter is...FIXME

                                                                                                                                                                                                                                                                                                      "},{"location":"shuffle/MigratableResolver/","title":"MigratableResolver","text":"

                                                                                                                                                                                                                                                                                                      MigratableResolver is an abstraction of resolvers that allow Spark to migrate shuffle blocks.

                                                                                                                                                                                                                                                                                                      "},{"location":"shuffle/MigratableResolver/#contract","title":"Contract","text":""},{"location":"shuffle/MigratableResolver/#getmigrationblocks","title":"getMigrationBlocks
                                                                                                                                                                                                                                                                                                      getMigrationBlocks(\n  shuffleBlockInfo: ShuffleBlockInfo): List[(BlockId, ManagedBuffer)]\n

                                                                                                                                                                                                                                                                                                      Used when:

                                                                                                                                                                                                                                                                                                      • ShuffleMigrationRunnable is requested to run
                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/MigratableResolver/#getstoredshuffles","title":"getStoredShuffles
                                                                                                                                                                                                                                                                                                      getStoredShuffles(): Seq[ShuffleBlockInfo]\n

                                                                                                                                                                                                                                                                                                      Used when:

                                                                                                                                                                                                                                                                                                      • BlockManagerDecommissioner is requested to refreshOffloadingShuffleBlocks
                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/MigratableResolver/#putshuffleblockasstream","title":"putShuffleBlockAsStream
                                                                                                                                                                                                                                                                                                      putShuffleBlockAsStream(\n  blockId: BlockId,\n  serializerManager: SerializerManager): StreamCallbackWithID\n

                                                                                                                                                                                                                                                                                                      Used when:

                                                                                                                                                                                                                                                                                                      • BlockManager is requested to putBlockDataAsStream
                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/MigratableResolver/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                      • IndexShuffleBlockResolver
                                                                                                                                                                                                                                                                                                      "},{"location":"shuffle/SerializedShuffleHandle/","title":"SerializedShuffleHandle","text":"

                                                                                                                                                                                                                                                                                                      SerializedShuffleHandle is a BaseShuffleHandle that SortShuffleManager uses when canUseSerializedShuffle (when requested to register a shuffle and BypassMergeSortShuffleHandles could not be selected).

                                                                                                                                                                                                                                                                                                      SerializedShuffleHandle tells SortShuffleManager to use UnsafeShuffleWriter when requested for a ShuffleWriter.

                                                                                                                                                                                                                                                                                                      "},{"location":"shuffle/SerializedShuffleHandle/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                      SerializedShuffleHandle takes the following to be created:

                                                                                                                                                                                                                                                                                                      • Shuffle ID
                                                                                                                                                                                                                                                                                                      • ShuffleDependency

                                                                                                                                                                                                                                                                                                        SerializedShuffleHandle is created when:

                                                                                                                                                                                                                                                                                                        • SortShuffleManager is requested for a ShuffleHandle (for the ShuffleDependency)
                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ShuffleBlockPusher/","title":"ShuffleBlockPusher","text":"

                                                                                                                                                                                                                                                                                                        ShuffleBlockPusher is...FIXME

                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ShuffleBlockResolver/","title":"ShuffleBlockResolver","text":"

                                                                                                                                                                                                                                                                                                        = [[ShuffleBlockResolver]] ShuffleBlockResolver

                                                                                                                                                                                                                                                                                                        ShuffleBlockResolver is an <> of <> that storage:BlockManager.md[BlockManager] uses to <> for a logical shuffle block identifier (i.e. map, reduce, and shuffle).

                                                                                                                                                                                                                                                                                                        NOTE: Shuffle block data files are often referred to as map outputs files.

                                                                                                                                                                                                                                                                                                        [[implementations]] NOTE: shuffle:IndexShuffleBlockResolver.md[IndexShuffleBlockResolver] is the default and only known ShuffleBlockResolver in Apache Spark.

                                                                                                                                                                                                                                                                                                        [[contract]] .ShuffleBlockResolver Contract [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                                                                                                                                                        | getBlockData a| [[getBlockData]]

                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ShuffleBlockResolver/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                        getBlockData( blockId: ShuffleBlockId): ManagedBuffer

                                                                                                                                                                                                                                                                                                        Retrieves the data (as a ManagedBuffer) for the given storage:BlockId.md#ShuffleBlockId[block] (a tuple of shuffleId, mapId and reduceId).

                                                                                                                                                                                                                                                                                                        Used when BlockManager is requested to retrieve a storage:BlockManager.md#getLocalBytes[block data from a local block manager] and storage:BlockManager.md#getBlockData[block data]

                                                                                                                                                                                                                                                                                                        | stop a| [[stop]]

                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ShuffleBlockResolver/#source-scala_1","title":"[source, scala]","text":""},{"location":"shuffle/ShuffleBlockResolver/#stop-unit","title":"stop(): Unit","text":"

                                                                                                                                                                                                                                                                                                        Stops the ShuffleBlockResolver

                                                                                                                                                                                                                                                                                                        Used when SortShuffleManager is requested to SortShuffleManager.md#stop[stop]

                                                                                                                                                                                                                                                                                                        |===

                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ShuffleDataIO/","title":"ShuffleDataIO","text":"

                                                                                                                                                                                                                                                                                                        ShuffleDataIO is an abstraction of pluggable temporary shuffle block store plugins for storing shuffle blocks in arbitrary storage backends.

                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ShuffleDataIO/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleDataIO/#shuffledrivercomponents","title":"ShuffleDriverComponents
                                                                                                                                                                                                                                                                                                        ShuffleDriverComponents driver()\n

                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                        • SparkContext is created
                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/ShuffleDataIO/#shuffleexecutorcomponents","title":"ShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                        ShuffleExecutorComponents executor()\n

                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                        • SortShuffleManager utility is used to load the ShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/ShuffleDataIO/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                        • LocalDiskShuffleDataIO
                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ShuffleDataIOUtils/","title":"ShuffleDataIOUtils","text":""},{"location":"shuffle/ShuffleDataIOUtils/#loading-shuffledataio","title":"Loading ShuffleDataIO
                                                                                                                                                                                                                                                                                                        loadShuffleDataIO(\n  conf: SparkConf): ShuffleDataIO\n

                                                                                                                                                                                                                                                                                                        loadShuffleDataIO uses the spark.shuffle.sort.io.plugin.class configuration property to load the ShuffleDataIO.

                                                                                                                                                                                                                                                                                                        loadShuffleDataIO\u00a0is used when:

                                                                                                                                                                                                                                                                                                        • SparkContext is created
                                                                                                                                                                                                                                                                                                        • SortShuffleManager utility is used to loadShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/ShuffleDriverComponents/","title":"ShuffleDriverComponents","text":"

                                                                                                                                                                                                                                                                                                        ShuffleDriverComponents is...FIXME

                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ShuffleExecutorComponents/","title":"ShuffleExecutorComponents","text":"

                                                                                                                                                                                                                                                                                                        ShuffleExecutorComponents is an abstraction of executor shuffle builders.

                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ShuffleExecutorComponents/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleExecutorComponents/#createmapoutputwriter","title":"createMapOutputWriter
                                                                                                                                                                                                                                                                                                        ShuffleMapOutputWriter createMapOutputWriter(\n  int shuffleId,\n  long mapTaskId,\n  int numPartitions) throws IOException\n

                                                                                                                                                                                                                                                                                                        Creates a ShuffleMapOutputWriter

                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                        • BypassMergeSortShuffleWriter is requested to write records
                                                                                                                                                                                                                                                                                                        • UnsafeShuffleWriter is requested to mergeSpills and mergeSpillsUsingStandardWriter
                                                                                                                                                                                                                                                                                                        • SortShuffleWriter is requested to write records
                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/ShuffleExecutorComponents/#createsinglefilemapoutputwriter","title":"createSingleFileMapOutputWriter
                                                                                                                                                                                                                                                                                                        Optional<SingleSpillShuffleMapOutputWriter> createSingleFileMapOutputWriter(\n  int shuffleId,\n  long mapId) throws IOException\n

                                                                                                                                                                                                                                                                                                        Creates a SingleSpillShuffleMapOutputWriter

                                                                                                                                                                                                                                                                                                        Default: empty

                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                        • UnsafeShuffleWriter is requested to mergeSpills
                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/ShuffleExecutorComponents/#initializeexecutor","title":"initializeExecutor
                                                                                                                                                                                                                                                                                                        void initializeExecutor(\n  String appId,\n  String execId,\n  Map<String, String> extraConfigs);\n

                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                        • SortShuffleManager utility is used to loadShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/ShuffleExecutorComponents/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                        • LocalDiskShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ShuffleExternalSorter/","title":"ShuffleExternalSorter","text":"

                                                                                                                                                                                                                                                                                                        ShuffleExternalSorter is a specialized cache-efficient sorter that sorts arrays of compressed record pointers and partition ids.

                                                                                                                                                                                                                                                                                                        ShuffleExternalSorter uses only 8 bytes of space per record in the sorting array to fit more of the array into cache.

                                                                                                                                                                                                                                                                                                        ShuffleExternalSorter is created and used by UnsafeShuffleWriter only.

                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ShuffleExternalSorter/#memoryconsumer","title":"MemoryConsumer

                                                                                                                                                                                                                                                                                                        ShuffleExternalSorter is a MemoryConsumer with page size of 128 MB (unless TaskMemoryManager uses smaller).

                                                                                                                                                                                                                                                                                                        ShuffleExternalSorter can spill to disk to free up execution memory.

                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/ShuffleExternalSorter/#configuration-properties","title":"Configuration Properties","text":""},{"location":"shuffle/ShuffleExternalSorter/#sparkshufflefilebuffer","title":"spark.shuffle.file.buffer

                                                                                                                                                                                                                                                                                                        ShuffleExternalSorter uses spark.shuffle.file.buffer configuration property for...FIXME

                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/ShuffleExternalSorter/#sparkshufflespillnumelementsforcespillthreshold","title":"spark.shuffle.spill.numElementsForceSpillThreshold

                                                                                                                                                                                                                                                                                                        ShuffleExternalSorter uses spark.shuffle.spill.numElementsForceSpillThreshold configuration property for...FIXME

                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"shuffle/ShuffleExternalSorter/#creating-instance","title":"Creating Instance

                                                                                                                                                                                                                                                                                                        ShuffleExternalSorter takes the following to be created:

                                                                                                                                                                                                                                                                                                        • TaskMemoryManager
                                                                                                                                                                                                                                                                                                        • BlockManager
                                                                                                                                                                                                                                                                                                        • TaskContext
                                                                                                                                                                                                                                                                                                        • Initial Size
                                                                                                                                                                                                                                                                                                        • Number of Partitions
                                                                                                                                                                                                                                                                                                        • SparkConf
                                                                                                                                                                                                                                                                                                        • ShuffleWriteMetricsReporter

                                                                                                                                                                                                                                                                                                          ShuffleExternalSorter is created when:

                                                                                                                                                                                                                                                                                                          • UnsafeShuffleWriter is requested to open a ShuffleExternalSorter
                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleExternalSorter/#shuffleinmemorysorter","title":"ShuffleInMemorySorter

                                                                                                                                                                                                                                                                                                          ShuffleExternalSorter manages a ShuffleInMemorySorter:

                                                                                                                                                                                                                                                                                                          • ShuffleInMemorySorter is created immediately when ShuffleExternalSorter is

                                                                                                                                                                                                                                                                                                          • ShuffleInMemorySorter is requested to free up memory and dereferenced (nulled) when ShuffleExternalSorter is requested to cleanupResources and closeAndGetSpills

                                                                                                                                                                                                                                                                                                          ShuffleExternalSorter uses the ShuffleInMemorySorter for the following:

                                                                                                                                                                                                                                                                                                          • writeSortedFile
                                                                                                                                                                                                                                                                                                          • spill
                                                                                                                                                                                                                                                                                                          • getMemoryUsage
                                                                                                                                                                                                                                                                                                          • growPointerArrayIfNecessary
                                                                                                                                                                                                                                                                                                          • insertRecord
                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleExternalSorter/#spilling-to-disk","title":"Spilling To Disk
                                                                                                                                                                                                                                                                                                          long spill(\n  long size,\n  MemoryConsumer trigger)\n

                                                                                                                                                                                                                                                                                                          spill is part of the MemoryConsumer abstraction.

                                                                                                                                                                                                                                                                                                          spill returns the memory bytes spilled (spill size).

                                                                                                                                                                                                                                                                                                          spill prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                          Thread [threadId] spilling sort data of [memoryUsage] to disk ([spillsSize] [time|times] so far)\n

                                                                                                                                                                                                                                                                                                          spill writeSortedFile (with the isLastFile flag disabled).

                                                                                                                                                                                                                                                                                                          spill frees up execution memory (and records the memory bytes spilled as spillSize).

                                                                                                                                                                                                                                                                                                          spill requests the ShuffleInMemorySorter to reset.

                                                                                                                                                                                                                                                                                                          In the end, spill requests the TaskContext for TaskMetrics to increase the memory bytes spilled.

                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleExternalSorter/#closeandgetspills","title":"closeAndGetSpills
                                                                                                                                                                                                                                                                                                          SpillInfo[] closeAndGetSpills()\n

                                                                                                                                                                                                                                                                                                          closeAndGetSpills...FIXME

                                                                                                                                                                                                                                                                                                          closeAndGetSpills is used when UnsafeShuffleWriter is requested to closeAndWriteOutput.

                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleExternalSorter/#getmemoryusage","title":"getMemoryUsage
                                                                                                                                                                                                                                                                                                          long getMemoryUsage()\n

                                                                                                                                                                                                                                                                                                          getMemoryUsage...FIXME

                                                                                                                                                                                                                                                                                                          getMemoryUsage is used when ShuffleExternalSorter is created and requested to spill and updatePeakMemoryUsed.

                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleExternalSorter/#updatepeakmemoryused","title":"updatePeakMemoryUsed
                                                                                                                                                                                                                                                                                                          void updatePeakMemoryUsed()\n

                                                                                                                                                                                                                                                                                                          updatePeakMemoryUsed...FIXME

                                                                                                                                                                                                                                                                                                          updatePeakMemoryUsed is used when ShuffleExternalSorter is requested to getPeakMemoryUsedBytes and freeMemory.

                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleExternalSorter/#writesortedfile","title":"writeSortedFile
                                                                                                                                                                                                                                                                                                          void writeSortedFile(\n  boolean isLastFile)\n

                                                                                                                                                                                                                                                                                                          writeSortedFile...FIXME

                                                                                                                                                                                                                                                                                                          writeSortedFile is used when:

                                                                                                                                                                                                                                                                                                          • ShuffleExternalSorter is requested to spill and closeAndGetSpills
                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleExternalSorter/#cleanupresources","title":"cleanupResources
                                                                                                                                                                                                                                                                                                          void cleanupResources()\n

                                                                                                                                                                                                                                                                                                          cleanupResources...FIXME

                                                                                                                                                                                                                                                                                                          cleanupResources is used when UnsafeShuffleWriter is requested to write records and stop.

                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleExternalSorter/#inserting-serialized-record-into-shuffleinmemorysorter","title":"Inserting Serialized Record Into ShuffleInMemorySorter
                                                                                                                                                                                                                                                                                                          void insertRecord(\n  Object recordBase,\n  long recordOffset,\n  int length,\n  int partitionId)\n

                                                                                                                                                                                                                                                                                                          insertRecord...FIXME

                                                                                                                                                                                                                                                                                                          insertRecord growPointerArrayIfNecessary.

                                                                                                                                                                                                                                                                                                          insertRecord...FIXME

                                                                                                                                                                                                                                                                                                          insertRecord acquireNewPageIfNecessary.

                                                                                                                                                                                                                                                                                                          insertRecord...FIXME

                                                                                                                                                                                                                                                                                                          insertRecord is used when UnsafeShuffleWriter is requested to insertRecordIntoSorter

                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleExternalSorter/#growpointerarrayifnecessary","title":"growPointerArrayIfNecessary
                                                                                                                                                                                                                                                                                                          void growPointerArrayIfNecessary()\n

                                                                                                                                                                                                                                                                                                          growPointerArrayIfNecessary...FIXME

                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleExternalSorter/#acquirenewpageifnecessary","title":"acquireNewPageIfNecessary
                                                                                                                                                                                                                                                                                                          void acquireNewPageIfNecessary(\n  int required)\n

                                                                                                                                                                                                                                                                                                          acquireNewPageIfNecessary...FIXME

                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleExternalSorter/#freememory","title":"freeMemory
                                                                                                                                                                                                                                                                                                          long freeMemory()\n

                                                                                                                                                                                                                                                                                                          freeMemory...FIXME

                                                                                                                                                                                                                                                                                                          freeMemory is used when ShuffleExternalSorter is requested to spill, cleanupResources, and closeAndGetSpills.

                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleExternalSorter/#peak-memory-used","title":"Peak Memory Used
                                                                                                                                                                                                                                                                                                          long getPeakMemoryUsedBytes()\n

                                                                                                                                                                                                                                                                                                          getPeakMemoryUsedBytes...FIXME

                                                                                                                                                                                                                                                                                                          getPeakMemoryUsedBytes is used when UnsafeShuffleWriter is requested to updatePeakMemoryUsed.

                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleExternalSorter/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                          Enable ALL logging level for org.apache.spark.shuffle.sort.ShuffleExternalSorter logger to see what happens inside.

                                                                                                                                                                                                                                                                                                          Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                          log4j.logger.org.apache.spark.shuffle.sort.ShuffleExternalSorter=ALL\n

                                                                                                                                                                                                                                                                                                          Refer to Logging.

                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleHandle/","title":"ShuffleHandle","text":"

                                                                                                                                                                                                                                                                                                          ShuffleHandle is an abstraction of shuffle handles for ShuffleManager to pass information about shuffles to tasks.

                                                                                                                                                                                                                                                                                                          ShuffleHandle is Serializable (Java).

                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleHandle/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                          • BaseShuffleHandle
                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleHandle/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                          ShuffleHandle takes the following to be created:

                                                                                                                                                                                                                                                                                                          • Shuffle ID

                                                                                                                                                                                                                                                                                                            Abstract Class

                                                                                                                                                                                                                                                                                                            ShuffleHandle is an abstract class and cannot be created directly. It is created indirectly for the concrete ShuffleHandles.

                                                                                                                                                                                                                                                                                                            "},{"location":"shuffle/ShuffleInMemorySorter/","title":"ShuffleInMemorySorter","text":"

                                                                                                                                                                                                                                                                                                            ShuffleInMemorySorter is used by ShuffleExternalSorter to <> using <> sort algorithms.

                                                                                                                                                                                                                                                                                                            == [[creating-instance]] Creating Instance

                                                                                                                                                                                                                                                                                                            ShuffleInMemorySorter takes the following to be created:

                                                                                                                                                                                                                                                                                                            • [[consumer]] memory:MemoryConsumer.md[MemoryConsumer]
                                                                                                                                                                                                                                                                                                            • [[initialSize]] Initial size
                                                                                                                                                                                                                                                                                                            • [[useRadixSort]] useRadixSort flag (to indicate whether to use <>)

                                                                                                                                                                                                                                                                                                              ShuffleInMemorySorter requests the given <> to memory:MemoryConsumer.md#allocateArray[allocate an array] of the given <> for the <>.

                                                                                                                                                                                                                                                                                                              ShuffleInMemorySorter is created for a shuffle:ShuffleExternalSorter.md#inMemSorter[ShuffleExternalSorter].

                                                                                                                                                                                                                                                                                                              == [[getSortedIterator]] Iterator of Records Sorted

                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ShuffleInMemorySorter/#source-java","title":"[source, java]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#shufflesorteriterator-getsortediterator","title":"ShuffleSorterIterator getSortedIterator()","text":"

                                                                                                                                                                                                                                                                                                              getSortedIterator...FIXME

                                                                                                                                                                                                                                                                                                              getSortedIterator is used when ShuffleExternalSorter is requested to shuffle:ShuffleExternalSorter.md#writeSortedFile[writeSortedFile].

                                                                                                                                                                                                                                                                                                              == [[reset]] Resetting

                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ShuffleInMemorySorter/#source-java_1","title":"[source, java]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#void-reset","title":"void reset()","text":"

                                                                                                                                                                                                                                                                                                              reset...FIXME

                                                                                                                                                                                                                                                                                                              reset is used when...FIXME

                                                                                                                                                                                                                                                                                                              == [[numRecords]] numRecords Method

                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ShuffleInMemorySorter/#source-java_2","title":"[source, java]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#int-numrecords","title":"int numRecords()","text":"

                                                                                                                                                                                                                                                                                                              numRecords...FIXME

                                                                                                                                                                                                                                                                                                              numRecords is used when...FIXME

                                                                                                                                                                                                                                                                                                              == [[getUsableCapacity]] Calculating Usable Capacity

                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ShuffleInMemorySorter/#source-java_3","title":"[source, java]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#int-getusablecapacity","title":"int getUsableCapacity()","text":"

                                                                                                                                                                                                                                                                                                              getUsableCapacity calculates the capacity that is a half or two-third of the memory used for the <>.

                                                                                                                                                                                                                                                                                                              getUsableCapacity is used when...FIXME

                                                                                                                                                                                                                                                                                                              == [[logging]] Logging

                                                                                                                                                                                                                                                                                                              Enable ALL logging level for org.apache.spark.shuffle.sort.ShuffleExternalSorter logger to see what happens inside.

                                                                                                                                                                                                                                                                                                              Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ShuffleInMemorySorter/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#log4jloggerorgapachesparkshufflesortshuffleexternalsorterall","title":"log4j.logger.org.apache.spark.shuffle.sort.ShuffleExternalSorter=ALL","text":"

                                                                                                                                                                                                                                                                                                              Refer to spark-logging.md[Logging].

                                                                                                                                                                                                                                                                                                              == [[internal-properties]] Internal Properties

                                                                                                                                                                                                                                                                                                              === [[array]] Unsafe LongArray of Record Pointers and Partition IDs

                                                                                                                                                                                                                                                                                                              ShuffleInMemorySorter uses a LongArray.

                                                                                                                                                                                                                                                                                                              === [[usableCapacity]] Usable Capacity

                                                                                                                                                                                                                                                                                                              ShuffleInMemorySorter...FIXME

                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ShuffleManager/","title":"ShuffleManager","text":"

                                                                                                                                                                                                                                                                                                              ShuffleManager is an abstraction of shuffle managers that manage shuffle data.

                                                                                                                                                                                                                                                                                                              ShuffleManager is specified using spark.shuffle.manager configuration property.

                                                                                                                                                                                                                                                                                                              ShuffleManager is used to create a BlockManager.

                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ShuffleManager/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleManager/#getting-shufflereader-for-shufflehandle","title":"Getting ShuffleReader for ShuffleHandle
                                                                                                                                                                                                                                                                                                              getReader[K, C](\n  handle: ShuffleHandle,\n  startPartition: Int,\n  endPartition: Int,\n  context: TaskContext,\n  metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]\n

                                                                                                                                                                                                                                                                                                              ShuffleReader to read shuffle data (for the given ShuffleHandle)

                                                                                                                                                                                                                                                                                                              Used when the following RDDs are requested to compute a partition:

                                                                                                                                                                                                                                                                                                              • CoGroupedRDD is requested to compute a partition
                                                                                                                                                                                                                                                                                                              • ShuffledRDD is requested to compute a partition
                                                                                                                                                                                                                                                                                                              • ShuffledRowRDD (Spark SQL) is requested to compute a partition
                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleManager/#getreaderforrange","title":"getReaderForRange
                                                                                                                                                                                                                                                                                                              getReaderForRange[K, C](\n  handle: ShuffleHandle,\n  startMapIndex: Int,\n  endMapIndex: Int,\n  startPartition: Int,\n  endPartition: Int,\n  context: TaskContext,\n  metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]\n

                                                                                                                                                                                                                                                                                                              ShuffleReader for a range of reduce partitions to read from map output in the ShuffleHandle

                                                                                                                                                                                                                                                                                                              Used when ShuffledRowRDD (Spark SQL) is requested to compute a partition

                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleManager/#getting-shufflewriter-for-shufflehandle","title":"Getting ShuffleWriter for ShuffleHandle
                                                                                                                                                                                                                                                                                                              getWriter[K, V](\n  handle: ShuffleHandle,\n  mapId: Long,\n  context: TaskContext,\n  metrics: ShuffleWriteMetricsReporter): ShuffleWriter[K, V]\n

                                                                                                                                                                                                                                                                                                              ShuffleWriter to write shuffle data in the ShuffleHandle

                                                                                                                                                                                                                                                                                                              Used when ShuffleWriteProcessor is requested to write a partition

                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleManager/#registering-shuffle-of-shuffledependency-and-getting-shufflehandle","title":"Registering Shuffle of ShuffleDependency (and Getting ShuffleHandle)
                                                                                                                                                                                                                                                                                                              registerShuffle[K, V, C](\n  shuffleId: Int,\n  dependency: ShuffleDependency[K, V, C]): ShuffleHandle\n

                                                                                                                                                                                                                                                                                                              Registers a shuffle (by the given shuffleId and ShuffleDependency) and gives a ShuffleHandle

                                                                                                                                                                                                                                                                                                              Used when ShuffleDependency is created (and registers with the shuffle system)

                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleManager/#shuffleblockresolver","title":"ShuffleBlockResolver
                                                                                                                                                                                                                                                                                                              shuffleBlockResolver: ShuffleBlockResolver\n

                                                                                                                                                                                                                                                                                                              ShuffleBlockResolver of the shuffle system

                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                              • SortShuffleManager is requested for a ShuffleWriter for a ShuffleHandle, to unregister a shuffle and stop
                                                                                                                                                                                                                                                                                                              • BlockManager is requested to getLocalBlockData and getHostLocalShuffleData
                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleManager/#stopping-shufflemanager","title":"Stopping ShuffleManager
                                                                                                                                                                                                                                                                                                              stop(): Unit\n

                                                                                                                                                                                                                                                                                                              Stops the shuffle system

                                                                                                                                                                                                                                                                                                              Used when SparkEnv is requested to stop

                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleManager/#unregistering-shuffle","title":"Unregistering Shuffle
                                                                                                                                                                                                                                                                                                              unregisterShuffle(\n  shuffleId: Int): Boolean\n

                                                                                                                                                                                                                                                                                                              Unregisters a given shuffle

                                                                                                                                                                                                                                                                                                              Used when BlockManagerSlaveEndpoint is requested to handle a RemoveShuffle message

                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleManager/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                              • SortShuffleManager
                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ShuffleManager/#accessing-shufflemanager-using-sparkenv","title":"Accessing ShuffleManager using SparkEnv

                                                                                                                                                                                                                                                                                                              ShuffleManager is available on the driver and executors using SparkEnv.shuffleManager.

                                                                                                                                                                                                                                                                                                              val shuffleManager = SparkEnv.get.shuffleManager\n
                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleMapOutputWriter/","title":"ShuffleMapOutputWriter","text":"

                                                                                                                                                                                                                                                                                                              ShuffleMapOutputWriter is...FIXME

                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ShuffleReader/","title":"ShuffleReader","text":"

                                                                                                                                                                                                                                                                                                              ShuffleReader is an abstraction of shuffle block readers that can read combined key-value records for a reduce task.

                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ShuffleReader/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleReader/#reading-combined-records-for-reduce-task","title":"Reading Combined Records (for Reduce Task)
                                                                                                                                                                                                                                                                                                              read(): Iterator[Product2[K, C]]\n

                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                              • CoGroupedRDD, ShuffledRDD are requested to compute a partition (for a ShuffleDependency dependency)
                                                                                                                                                                                                                                                                                                              • ShuffledRowRDD (Spark SQL) is requested to compute a partition
                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleReader/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                              • BlockStoreShuffleReader
                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ShuffleWriteMetricsReporter/","title":"ShuffleWriteMetricsReporter","text":"

                                                                                                                                                                                                                                                                                                              ShuffleWriteMetricsReporter is an abstraction of shuffle write metrics reporters.

                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ShuffleWriteMetricsReporter/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#decbyteswritten","title":"decBytesWritten
                                                                                                                                                                                                                                                                                                              decBytesWritten(\n  v: Long): Unit\n
                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#decrecordswritten","title":"decRecordsWritten
                                                                                                                                                                                                                                                                                                              decRecordsWritten(\n  v: Long): Unit\n
                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#incbyteswritten","title":"incBytesWritten
                                                                                                                                                                                                                                                                                                              incBytesWritten(\n  v: Long): Unit\n
                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#increcordswritten","title":"incRecordsWritten
                                                                                                                                                                                                                                                                                                              incRecordsWritten(\n  v: Long): Unit\n

                                                                                                                                                                                                                                                                                                              See ShuffleWriteMetrics

                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                              • ShufflePartitionPairsWriter is requested to recordWritten
                                                                                                                                                                                                                                                                                                              • ShuffleExternalSorter is requested to writeSortedFile
                                                                                                                                                                                                                                                                                                              • DiskBlockObjectWriter is requested to record bytes written
                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#incwritetime","title":"incWriteTime
                                                                                                                                                                                                                                                                                                              incWriteTime(\n  v: Long): Unit\n

                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                              • BypassMergeSortShuffleWriter is requested to write partition records and writePartitionedData
                                                                                                                                                                                                                                                                                                              • UnsafeShuffleWriter is requested to mergeSpillsWithTransferTo
                                                                                                                                                                                                                                                                                                              • DiskBlockObjectWriter is requested to commitAndGet
                                                                                                                                                                                                                                                                                                              • TimeTrackingOutputStream is requested to write, flush, and close
                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                              • ShuffleWriteMetrics
                                                                                                                                                                                                                                                                                                              • SQLShuffleWriteMetricsReporter (Spark SQL)
                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ShuffleWriteProcessor/","title":"ShuffleWriteProcessor","text":"

                                                                                                                                                                                                                                                                                                              ShuffleWriteProcessor controls write behavior in ShuffleMapTasks while writing partition records out to the shuffle system.

                                                                                                                                                                                                                                                                                                              ShuffleWriteProcessor is used to create a ShuffleDependency.

                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ShuffleWriteProcessor/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                              ShuffleWriteProcessor takes no arguments to be created.

                                                                                                                                                                                                                                                                                                              ShuffleWriteProcessor is created when:

                                                                                                                                                                                                                                                                                                              • ShuffleDependency is created
                                                                                                                                                                                                                                                                                                              • ShuffleExchangeExec (Spark SQL) physical operator is requested to createShuffleWriteProcessor
                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ShuffleWriteProcessor/#writing-partition-records-to-shuffle-system","title":"Writing Partition Records to Shuffle System
                                                                                                                                                                                                                                                                                                              write(\n  rdd: RDD[_],\n  dep: ShuffleDependency[_, _, _],\n  mapId: Long,\n  context: TaskContext,\n  partition: Partition): MapStatus\n

                                                                                                                                                                                                                                                                                                              write requests the ShuffleManager for the ShuffleWriter for the ShuffleHandle (of the given ShuffleDependency).

                                                                                                                                                                                                                                                                                                              write requests the ShuffleWriter to write out records (of the given Partition and RDD).

                                                                                                                                                                                                                                                                                                              In the end, write requests the ShuffleWriter to stop (with the success flag enabled).

                                                                                                                                                                                                                                                                                                              In case of any Exceptions, write requests the ShuffleWriter to stop (with the success flag disabled).

                                                                                                                                                                                                                                                                                                              write\u00a0is used when ShuffleMapTask is requested to run.

                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleWriteProcessor/#creating-metricsreporter","title":"Creating MetricsReporter
                                                                                                                                                                                                                                                                                                              createMetricsReporter(\n  context: TaskContext): ShuffleWriteMetricsReporter\n

                                                                                                                                                                                                                                                                                                              createMetricsReporter creates a ShuffleWriteMetricsReporter from the given TaskContext.

                                                                                                                                                                                                                                                                                                              createMetricsReporter requests the given TaskContext for TaskMetrics and then for the ShuffleWriteMetrics.

                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleWriter/","title":"ShuffleWriter","text":"

                                                                                                                                                                                                                                                                                                              ShuffleWriter[K, V] (of K keys and V values) is an abstraction of shuffle writers that can write out key-value records (of a RDD partition) to a shuffle system.

                                                                                                                                                                                                                                                                                                              ShuffleWriter is used when ShuffleMapTask is requested to run (and uses a ShuffleWriteProcessor to write partition records to a shuffle system).

                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/ShuffleWriter/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleWriter/#writing-out-partition-records-to-shuffle-system","title":"Writing Out Partition Records to Shuffle System
                                                                                                                                                                                                                                                                                                              write(\n  records: Iterator[Product2[K, V]]): Unit\n

                                                                                                                                                                                                                                                                                                              Writes key-value records (of a partition) out to a shuffle system

                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                              • ShuffleWriteProcessor is requested to write
                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleWriter/#stopping-shufflewriter","title":"Stopping ShuffleWriter
                                                                                                                                                                                                                                                                                                              stop(\n  success: Boolean): Option[MapStatus]\n

                                                                                                                                                                                                                                                                                                              Stops (closes) the ShuffleWriter and returns a MapStatus if the writing completed successfully. The success flag is the status of the task execution.

                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                              • ShuffleWriteProcessor is requested to write
                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/ShuffleWriter/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                              • BypassMergeSortShuffleWriter
                                                                                                                                                                                                                                                                                                              • SortShuffleWriter
                                                                                                                                                                                                                                                                                                              • UnsafeShuffleWriter
                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/SingleSpillShuffleMapOutputWriter/","title":"SingleSpillShuffleMapOutputWriter","text":"

                                                                                                                                                                                                                                                                                                              SingleSpillShuffleMapOutputWriter is...FIXME

                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/SortShuffleManager/","title":"SortShuffleManager","text":"

                                                                                                                                                                                                                                                                                                              SortShuffleManager is the default and only ShuffleManager in Apache Spark (with the short name sort or tungsten-sort).

                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/SortShuffleManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                              SortShuffleManager takes the following to be created:

                                                                                                                                                                                                                                                                                                              • SparkConf

                                                                                                                                                                                                                                                                                                                SortShuffleManager is created when SparkEnv is created (on the driver and executors at the very beginning of a Spark application's lifecycle).

                                                                                                                                                                                                                                                                                                                "},{"location":"shuffle/SortShuffleManager/#taskidmapsforshuffle-registry","title":"taskIdMapsForShuffle Registry
                                                                                                                                                                                                                                                                                                                taskIdMapsForShuffle: ConcurrentHashMap[Int, OpenHashSet[Long]]\n

                                                                                                                                                                                                                                                                                                                SortShuffleManager uses taskIdMapsForShuffle internal registry to track task (attempt) IDs by shuffle.

                                                                                                                                                                                                                                                                                                                A new shuffle and task IDs are added when SortShuffleManager is requested for a ShuffleWriter (for a partition and a ShuffleHandle).

                                                                                                                                                                                                                                                                                                                A shuffle ID (and associated task IDs) are removed when SortShuffleManager is requested to unregister a shuffle.

                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/SortShuffleManager/#getting-shufflewriter-for-partition-and-shufflehandle","title":"Getting ShuffleWriter for Partition and ShuffleHandle
                                                                                                                                                                                                                                                                                                                getWriter[K, V](\n  handle: ShuffleHandle,\n  mapId: Int,\n  context: TaskContext): ShuffleWriter[K, V]\n

                                                                                                                                                                                                                                                                                                                getWriter registers the given ShuffleHandle (by the shuffleId and numMaps) in the taskIdMapsForShuffle internal registry unless already done.

                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                getWriter expects that the input ShuffleHandle is a BaseShuffleHandle. Moreover, getWriter expects that in two (out of three cases) it is a more specialized IndexShuffleBlockResolver.

                                                                                                                                                                                                                                                                                                                getWriter then creates a new ShuffleWriter based on the type of the given ShuffleHandle.

                                                                                                                                                                                                                                                                                                                ShuffleHandle ShuffleWriter SerializedShuffleHandle UnsafeShuffleWriter BypassMergeSortShuffleHandle BypassMergeSortShuffleWriter BaseShuffleHandle SortShuffleWriter

                                                                                                                                                                                                                                                                                                                getWriter is part of the ShuffleManager abstraction.

                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/SortShuffleManager/#shuffleexecutorcomponents","title":"ShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                                shuffleExecutorComponents: ShuffleExecutorComponents\n

                                                                                                                                                                                                                                                                                                                SortShuffleManager defines the shuffleExecutorComponents internal registry for a ShuffleExecutorComponents.

                                                                                                                                                                                                                                                                                                                shuffleExecutorComponents\u00a0is used when:

                                                                                                                                                                                                                                                                                                                • SortShuffleManager is requested for the ShuffleWriter
                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/SortShuffleManager/#loadshuffleexecutorcomponents","title":"loadShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                                loadShuffleExecutorComponents(\n  conf: SparkConf): ShuffleExecutorComponents\n

                                                                                                                                                                                                                                                                                                                loadShuffleExecutorComponents loads the ShuffleDataIO that is then requested for the ShuffleExecutorComponents.

                                                                                                                                                                                                                                                                                                                loadShuffleExecutorComponents requests the ShuffleExecutorComponents to initialize before returning it.

                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/SortShuffleManager/#creating-shufflehandle-for-shuffledependency","title":"Creating ShuffleHandle for ShuffleDependency
                                                                                                                                                                                                                                                                                                                registerShuffle[K, V, C](\n  shuffleId: Int,\n  dependency: ShuffleDependency[K, V, C]): ShuffleHandle\n

                                                                                                                                                                                                                                                                                                                registerShuffle\u00a0is part of the ShuffleManager abstraction.

                                                                                                                                                                                                                                                                                                                registerShuffle creates a new ShuffleHandle (for the given ShuffleDependency) that is one of the following:

                                                                                                                                                                                                                                                                                                                1. BypassMergeSortShuffleHandle (with ShuffleDependency[K, V, V]) when shouldBypassMergeSort condition holds

                                                                                                                                                                                                                                                                                                                2. SerializedShuffleHandle (with ShuffleDependency[K, V, V]) when canUseSerializedShuffle condition holds

                                                                                                                                                                                                                                                                                                                3. BaseShuffleHandle

                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/SortShuffleManager/#serializedshufflehandle-requirements","title":"SerializedShuffleHandle Requirements
                                                                                                                                                                                                                                                                                                                canUseSerializedShuffle(\n  dependency: ShuffleDependency[_, _, _]): Boolean\n

                                                                                                                                                                                                                                                                                                                canUseSerializedShuffle is true when all of the following hold for the given ShuffleDependency:

                                                                                                                                                                                                                                                                                                                1. Serializer (of the given ShuffleDependency) supports relocation of serialized objects

                                                                                                                                                                                                                                                                                                                2. mapSideCombine flag (of the given ShuffleDependency) is false

                                                                                                                                                                                                                                                                                                                3. Number of partitions (of the Partitioner of the given ShuffleDependency) is not greater than the supported maximum number

                                                                                                                                                                                                                                                                                                                With all of the above positive, canUseSerializedShuffle prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                Can use serialized shuffle for shuffle [shuffleId]\n

                                                                                                                                                                                                                                                                                                                Otherwise, canUseSerializedShuffle is false and prints out one of the following DEBUG messages based on the failed requirement:

                                                                                                                                                                                                                                                                                                                Can't use serialized shuffle for shuffle [id] because the serializer, [name], does not support object relocation\n
                                                                                                                                                                                                                                                                                                                Can't use serialized shuffle for shuffle [id] because we need to do map-side aggregation\n
                                                                                                                                                                                                                                                                                                                Can't use serialized shuffle for shuffle [id] because it has more than [number] partitions\n

                                                                                                                                                                                                                                                                                                                canUseSerializedShuffle\u00a0is used when:

                                                                                                                                                                                                                                                                                                                • SortShuffleManager is requested to register a shuffle (and creates a ShuffleHandle)
                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/SortShuffleManager/#maximum-number-of-partition-identifiers-for-serialized-mode","title":"Maximum Number of Partition Identifiers for Serialized Mode

                                                                                                                                                                                                                                                                                                                SortShuffleManager defines MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE internal constant to be (1 << 24) (16777216) for the maximum number of shuffle output partitions.

                                                                                                                                                                                                                                                                                                                MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE is used when:

                                                                                                                                                                                                                                                                                                                • UnsafeShuffleWriter is created
                                                                                                                                                                                                                                                                                                                • SortShuffleManager utility is used to check out SerializedShuffleHandle requirements
                                                                                                                                                                                                                                                                                                                • ShuffleExchangeExec (Spark SQL) utility is used to needToCopyObjectsBeforeShuffle
                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/SortShuffleManager/#creating-shuffleblockresolver","title":"Creating ShuffleBlockResolver
                                                                                                                                                                                                                                                                                                                shuffleBlockResolver: IndexShuffleBlockResolver\n

                                                                                                                                                                                                                                                                                                                shuffleBlockResolver\u00a0is part of the ShuffleManager abstraction.

                                                                                                                                                                                                                                                                                                                shuffleBlockResolver is a IndexShuffleBlockResolver (and is created immediately alongside this SortShuffleManager).

                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/SortShuffleManager/#unregistering-shuffle","title":"Unregistering Shuffle
                                                                                                                                                                                                                                                                                                                unregisterShuffle(\n  shuffleId: Int): Boolean\n

                                                                                                                                                                                                                                                                                                                unregisterShuffle\u00a0is part of the ShuffleManager abstraction.

                                                                                                                                                                                                                                                                                                                unregisterShuffle removes the given shuffleId from the taskIdMapsForShuffle internal registry.

                                                                                                                                                                                                                                                                                                                If the shuffleId was found and removed successfully, unregisterShuffle requests the IndexShuffleBlockResolver to remove the shuffle index and data files for every mapTaskId (mappers producing the output for the shuffle).

                                                                                                                                                                                                                                                                                                                unregisterShuffle is always true.

                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/SortShuffleManager/#getting-shufflereader-for-shufflehandle","title":"Getting ShuffleReader for ShuffleHandle
                                                                                                                                                                                                                                                                                                                getReader[K, C](\n  handle: ShuffleHandle,\n  startMapIndex: Int,\n  endMapIndex: Int,\n  startPartition: Int,\n  endPartition: Int,\n  context: TaskContext,\n  metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]\n

                                                                                                                                                                                                                                                                                                                getReader\u00a0is part of the ShuffleManager abstraction.

                                                                                                                                                                                                                                                                                                                getReader requests the MapOutputTracker (via SparkEnv) for the getMapSizesByExecutorId for the shuffleId (of the given ShuffleHandle).

                                                                                                                                                                                                                                                                                                                In the end, getReader creates a new BlockStoreShuffleReader.

                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/SortShuffleManager/#stopping-shufflemanager","title":"Stopping ShuffleManager
                                                                                                                                                                                                                                                                                                                stop(): Unit\n

                                                                                                                                                                                                                                                                                                                stop\u00a0is part of the ShuffleManager abstraction.

                                                                                                                                                                                                                                                                                                                stop requests the IndexShuffleBlockResolver to stop.

                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/SortShuffleManager/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                Enable ALL logging level for org.apache.spark.shuffle.sort.SortShuffleManager logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                log4j.logger.org.apache.spark.shuffle.sort.SortShuffleManager=ALL\n

                                                                                                                                                                                                                                                                                                                Refer to Logging.

                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/SortShuffleWriter/","title":"SortShuffleWriter \u2014 Fallback ShuffleWriter","text":"

                                                                                                                                                                                                                                                                                                                SortShuffleWriter is a \"fallback\" ShuffleWriter (when SortShuffleManager is requested for a ShuffleWriter and the more specialized BypassMergeSortShuffleWriter and UnsafeShuffleWriter could not be used).

                                                                                                                                                                                                                                                                                                                SortShuffleWriter[K, V, C] is a parameterized type with K keys, V values, and C combiner values.

                                                                                                                                                                                                                                                                                                                "},{"location":"shuffle/SortShuffleWriter/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                SortShuffleWriter takes the following to be created:

                                                                                                                                                                                                                                                                                                                • IndexShuffleBlockResolver (unused)
                                                                                                                                                                                                                                                                                                                • BaseShuffleHandle
                                                                                                                                                                                                                                                                                                                • Map ID
                                                                                                                                                                                                                                                                                                                • TaskContext
                                                                                                                                                                                                                                                                                                                • ShuffleExecutorComponents

                                                                                                                                                                                                                                                                                                                  SortShuffleWriter is created\u00a0when:

                                                                                                                                                                                                                                                                                                                  • SortShuffleManager is requested for a ShuffleWriter (for a given ShuffleHandle)
                                                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/SortShuffleWriter/#mapstatus","title":"MapStatus

                                                                                                                                                                                                                                                                                                                  SortShuffleWriter uses mapStatus internal registry for a MapStatus after writing records.

                                                                                                                                                                                                                                                                                                                  Writing records itself does not return a value and SortShuffleWriter uses the registry when requested to stop (which allows returning a MapStatus).

                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/SortShuffleWriter/#writing-records-into-shuffle-partitioned-file-in-disk-store","title":"Writing Records (Into Shuffle Partitioned File In Disk Store)
                                                                                                                                                                                                                                                                                                                  write(\n  records: Iterator[Product2[K, V]]): Unit\n

                                                                                                                                                                                                                                                                                                                  write is part of the ShuffleWriter abstraction.

                                                                                                                                                                                                                                                                                                                  write creates an ExternalSorter based on the ShuffleDependency (of the BaseShuffleHandle), namely the Map-Size Partial Aggregation flag. The ExternalSorter uses the aggregator and key ordering when the flag is enabled.

                                                                                                                                                                                                                                                                                                                  write requests the ExternalSorter to insert all the given records.

                                                                                                                                                                                                                                                                                                                  write...FIXME

                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/SortShuffleWriter/#stopping-sortshufflewriter-and-calculating-mapstatus","title":"Stopping SortShuffleWriter (and Calculating MapStatus)
                                                                                                                                                                                                                                                                                                                  stop(\n  success: Boolean): Option[MapStatus]\n

                                                                                                                                                                                                                                                                                                                  stop is part of the ShuffleWriter abstraction.

                                                                                                                                                                                                                                                                                                                  stop turns the stopping flag on and returns the internal mapStatus if the input success is enabled.

                                                                                                                                                                                                                                                                                                                  Otherwise, when stopping flag is already enabled or the input success is disabled, stop returns no MapStatus (i.e. None).

                                                                                                                                                                                                                                                                                                                  In the end, stop requests the ExternalSorter to stop and increments the shuffle write time task metrics.

                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/SortShuffleWriter/#requirements-of-bypassmergesortshufflehandle-as-shufflehandle","title":"Requirements of BypassMergeSortShuffleHandle (as ShuffleHandle)
                                                                                                                                                                                                                                                                                                                  shouldBypassMergeSort(\n  conf: SparkConf,\n  dep: ShuffleDependency[_, _, _]): Boolean\n

                                                                                                                                                                                                                                                                                                                  shouldBypassMergeSort returns true when all of the following hold:

                                                                                                                                                                                                                                                                                                                  1. No map-side aggregation (the mapSideCombine flag of the given ShuffleDependency is off)

                                                                                                                                                                                                                                                                                                                  2. Number of partitions (of the Partitioner of the given ShuffleDependency) is not greater than spark.shuffle.sort.bypassMergeThreshold configuration property

                                                                                                                                                                                                                                                                                                                  Otherwise, shouldBypassMergeSort does not hold (false).

                                                                                                                                                                                                                                                                                                                  shouldBypassMergeSort is used when:

                                                                                                                                                                                                                                                                                                                  • SortShuffleManager is requested to register a shuffle (and creates a ShuffleHandle)
                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/SortShuffleWriter/#stopping-flag","title":"stopping Flag

                                                                                                                                                                                                                                                                                                                  SortShuffleWriter uses stopping internal flag to indicate whether or not this SortShuffleWriter has been stopped.

                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/SortShuffleWriter/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                  Enable ALL logging level for org.apache.spark.shuffle.sort.SortShuffleWriter logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                  Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                  log4j.logger.org.apache.spark.shuffle.sort.SortShuffleWriter=ALL\n

                                                                                                                                                                                                                                                                                                                  Refer to Logging.

                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/Spillable/","title":"Spillable","text":"

                                                                                                                                                                                                                                                                                                                  Spillable is an extension of the MemoryConsumer abstraction for spillable collections that can spill to disk.

                                                                                                                                                                                                                                                                                                                  Spillable[C] is a parameterized type of C combiner (partial) values.

                                                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/Spillable/#contract","title":"Contract","text":""},{"location":"shuffle/Spillable/#forcespill","title":"forceSpill
                                                                                                                                                                                                                                                                                                                  forceSpill(): Boolean\n

                                                                                                                                                                                                                                                                                                                  Force spilling the current in-memory collection to disk to release memory.

                                                                                                                                                                                                                                                                                                                  Used when Spillable is requested to spill

                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/Spillable/#spill","title":"spill
                                                                                                                                                                                                                                                                                                                  spill(\n  collection: C): Unit\n

                                                                                                                                                                                                                                                                                                                  Spills the current in-memory collection to disk, and releases the memory.

                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                  • ExternalAppendOnlyMap is requested to forceSpill
                                                                                                                                                                                                                                                                                                                  • Spillable is requested to spilling to disk if necessary
                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/Spillable/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                  • ExternalAppendOnlyMap
                                                                                                                                                                                                                                                                                                                  • ExternalSorter
                                                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/Spillable/#memory-threshold","title":"Memory Threshold

                                                                                                                                                                                                                                                                                                                  Spillable uses a threshold for the memory size (in bytes) to know when to spill to disk.

                                                                                                                                                                                                                                                                                                                  When the size of the in-memory collection is above the threshold, Spillable will try to acquire more memory. Unless given all requested memory, Spillable spills to disk.

                                                                                                                                                                                                                                                                                                                  The memory threshold starts as spark.shuffle.spill.initialMemoryThreshold configuration property and is increased every time Spillable is requested to spill to disk if needed, but managed to acquire required memory. The threshold goes back to the initial value when requested to release all memory.

                                                                                                                                                                                                                                                                                                                  Used when Spillable is requested to spill and releaseMemory.

                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/Spillable/#creating-instance","title":"Creating Instance

                                                                                                                                                                                                                                                                                                                  Spillable takes the following to be created:

                                                                                                                                                                                                                                                                                                                  • TaskMemoryManager Abstract Class

                                                                                                                                                                                                                                                                                                                    Spillable is an abstract class and cannot be created directly. It is created indirectly for the concrete Spillables.

                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/Spillable/#configuration-properties","title":"Configuration Properties","text":""},{"location":"shuffle/Spillable/#sparkshufflespillnumelementsforcespillthreshold","title":"spark.shuffle.spill.numElementsForceSpillThreshold

                                                                                                                                                                                                                                                                                                                    Spillable uses spark.shuffle.spill.numElementsForceSpillThreshold configuration property to force spilling in-memory objects to disk when requested to maybeSpill.

                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/Spillable/#sparkshufflespillinitialmemorythreshold","title":"spark.shuffle.spill.initialMemoryThreshold

                                                                                                                                                                                                                                                                                                                    Spillable uses spark.shuffle.spill.initialMemoryThreshold configuration property as the initial threshold for the size of a collection (and the minimum memory required to operate properly).

                                                                                                                                                                                                                                                                                                                    Spillable uses it when requested to spill and releaseMemory.

                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/Spillable/#releasing-all-memory","title":"Releasing All Memory
                                                                                                                                                                                                                                                                                                                    releaseMemory(): Unit\n

                                                                                                                                                                                                                                                                                                                    releaseMemory...FIXME

                                                                                                                                                                                                                                                                                                                    releaseMemory is used when:

                                                                                                                                                                                                                                                                                                                    • ExternalAppendOnlyMap is requested to freeCurrentMap
                                                                                                                                                                                                                                                                                                                    • ExternalSorter is requested to stop
                                                                                                                                                                                                                                                                                                                    • Spillable is requested to maybeSpill and spill (and spilled to disk in either case)
                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/Spillable/#spilling-in-memory-collection-to-disk-to-release-memory","title":"Spilling In-Memory Collection to Disk (to Release Memory)
                                                                                                                                                                                                                                                                                                                    spill(\n  collection: C): Unit\n

                                                                                                                                                                                                                                                                                                                    spill spills the given in-memory collection to disk to release memory.

                                                                                                                                                                                                                                                                                                                    spill is used when:

                                                                                                                                                                                                                                                                                                                    • ExternalAppendOnlyMap is requested to forceSpill
                                                                                                                                                                                                                                                                                                                    • Spillable is requested to maybeSpill
                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/Spillable/#forcespill_1","title":"forceSpill
                                                                                                                                                                                                                                                                                                                    forceSpill(): Boolean\n

                                                                                                                                                                                                                                                                                                                    forceSpill forcefully spills the Spillable to disk to release memory.

                                                                                                                                                                                                                                                                                                                    forceSpill is used when Spillable is requested to spill an in-memory collection to disk.

                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/Spillable/#spilling-to-disk-if-necessary","title":"Spilling to Disk if Necessary
                                                                                                                                                                                                                                                                                                                    maybeSpill(\n  collection: C,\n  currentMemory: Long): Boolean\n

                                                                                                                                                                                                                                                                                                                    maybeSpill...FIXME

                                                                                                                                                                                                                                                                                                                    maybeSpill is used when:

                                                                                                                                                                                                                                                                                                                    • ExternalAppendOnlyMap is requested to insertAll
                                                                                                                                                                                                                                                                                                                    • ExternalSorter is requested to attempt to spill an in-memory collection to disk if needed
                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/UnsafeShuffleWriter/","title":"UnsafeShuffleWriter","text":"

                                                                                                                                                                                                                                                                                                                    UnsafeShuffleWriter<K, V> is a ShuffleWriter for SerializedShuffleHandles.

                                                                                                                                                                                                                                                                                                                    UnsafeShuffleWriter opens resources (a ShuffleExternalSorter and the buffers) while being created.

                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/UnsafeShuffleWriter/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                    UnsafeShuffleWriter takes the following to be created:

                                                                                                                                                                                                                                                                                                                    • BlockManager
                                                                                                                                                                                                                                                                                                                    • TaskMemoryManager
                                                                                                                                                                                                                                                                                                                    • SerializedShuffleHandle
                                                                                                                                                                                                                                                                                                                    • Map ID
                                                                                                                                                                                                                                                                                                                    • TaskContext
                                                                                                                                                                                                                                                                                                                    • SparkConf
                                                                                                                                                                                                                                                                                                                    • ShuffleWriteMetricsReporter
                                                                                                                                                                                                                                                                                                                    • ShuffleExecutorComponents

                                                                                                                                                                                                                                                                                                                      UnsafeShuffleWriter is created when SortShuffleManager is requested for a ShuffleWriter for a SerializedShuffleHandle.

                                                                                                                                                                                                                                                                                                                      UnsafeShuffleWriter makes sure that the number of partitions at most 16MB reduce partitions (1 << 24) (as the upper bound of the partition identifiers that can be encoded) or throws an IllegalArgumentException:

                                                                                                                                                                                                                                                                                                                      UnsafeShuffleWriter can only be used for shuffles with at most 16777215 reduce partitions\n

                                                                                                                                                                                                                                                                                                                      UnsafeShuffleWriter uses the number of partitions of the Partitioner that is used for the ShuffleDependency of the SerializedShuffleHandle.

                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                      The number of shuffle output partitions is first enforced when SortShuffleManager is requested to check out whether a SerializedShuffleHandle can be used for ShuffleHandle (that eventually leads to UnsafeShuffleWriter).

                                                                                                                                                                                                                                                                                                                      In the end, UnsafeShuffleWriter creates a ShuffleExternalSorter and a SerializationStream.

                                                                                                                                                                                                                                                                                                                      "},{"location":"shuffle/UnsafeShuffleWriter/#shuffleexternalsorter","title":"ShuffleExternalSorter

                                                                                                                                                                                                                                                                                                                      UnsafeShuffleWriter uses a ShuffleExternalSorter.

                                                                                                                                                                                                                                                                                                                      ShuffleExternalSorter is created when UnsafeShuffleWriter is requested to open (while being created) and dereferenced (nulled) when requested to close internal resources and merge spill files.

                                                                                                                                                                                                                                                                                                                      Used when UnsafeShuffleWriter is requested for the following:

                                                                                                                                                                                                                                                                                                                      • Updating peak memory used
                                                                                                                                                                                                                                                                                                                      • Writing records
                                                                                                                                                                                                                                                                                                                      • Closing internal resources and merging spill files
                                                                                                                                                                                                                                                                                                                      • Inserting a record
                                                                                                                                                                                                                                                                                                                      • Stopping
                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#indexshuffleblockresolver","title":"IndexShuffleBlockResolver

                                                                                                                                                                                                                                                                                                                      UnsafeShuffleWriter is given a IndexShuffleBlockResolver when created.

                                                                                                                                                                                                                                                                                                                      UnsafeShuffleWriter uses the IndexShuffleBlockResolver for...FIXME

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#initial-serialized-buffer-size","title":"Initial Serialized Buffer Size

                                                                                                                                                                                                                                                                                                                      UnsafeShuffleWriter uses a fixed buffer size for the output stream of serialized data written into a byte array (default: 1024 * 1024).

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#inputbuffersizeinbytes","title":"inputBufferSizeInBytes

                                                                                                                                                                                                                                                                                                                      UnsafeShuffleWriter uses the spark.shuffle.file.buffer configuration property for...FIXME

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#outputbuffersizeinbytes","title":"outputBufferSizeInBytes

                                                                                                                                                                                                                                                                                                                      UnsafeShuffleWriter uses the spark.shuffle.unsafe.file.output.buffer configuration property for...FIXME

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#transfertoenabled","title":"transferToEnabled

                                                                                                                                                                                                                                                                                                                      UnsafeShuffleWriter can use a specialized NIO-based fast merge procedure that avoids extra serialization/deserialization when spark.file.transferTo configuration property is enabled.

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#initialsortbuffersize","title":"initialSortBufferSize

                                                                                                                                                                                                                                                                                                                      UnsafeShuffleWriter uses the initial buffer size for sorting (default: 4096) when creating a ShuffleExternalSorter (when requested to open).

                                                                                                                                                                                                                                                                                                                      Tip

                                                                                                                                                                                                                                                                                                                      Use spark.shuffle.sort.initialBufferSize configuration property to change the buffer size.

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#merging-spills","title":"Merging Spills
                                                                                                                                                                                                                                                                                                                      long[] mergeSpills(\n  SpillInfo[] spills,\n  File outputFile)\n
                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#many-spills","title":"Many Spills

                                                                                                                                                                                                                                                                                                                      With multiple SpillInfos to merge, mergeSpills selects between fast and slow merge strategies. The fast merge strategy can be transferTo- or fileStream-based.

                                                                                                                                                                                                                                                                                                                      mergeSpills uses the spark.shuffle.unsafe.fastMergeEnabled configuration property to consider one of the fast merge strategies.

                                                                                                                                                                                                                                                                                                                      A fast merge strategy is supported when spark.shuffle.compress configuration property is disabled or the IO compression codec supports decompression of concatenated compressed streams.

                                                                                                                                                                                                                                                                                                                      With spark.shuffle.compress configuration property enabled, mergeSpills will always use the slow merge strategy.

                                                                                                                                                                                                                                                                                                                      With fast merge strategy enabled and supported, transferToEnabled enabled and encryption disabled, mergeSpills prints out the following DEBUG message to the logs and mergeSpillsWithTransferTo.

                                                                                                                                                                                                                                                                                                                      Using transferTo-based fast merge\n

                                                                                                                                                                                                                                                                                                                      With fast merge strategy enabled and supported, no transferToEnabled or encryption enabled, mergeSpills prints out the following DEBUG message to the logs and mergeSpillsWithFileStream (with no compression codec).

                                                                                                                                                                                                                                                                                                                      Using fileStream-based fast merge\n

                                                                                                                                                                                                                                                                                                                      For slow merge, mergeSpills prints out the following DEBUG message to the logs and mergeSpillsWithFileStream (with the compression codec).

                                                                                                                                                                                                                                                                                                                      Using slow merge\n

                                                                                                                                                                                                                                                                                                                      In the end, mergeSpills requests the ShuffleWriteMetrics to decBytesWritten and incBytesWritten, and returns the partition length array.

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#one-spill","title":"One Spill

                                                                                                                                                                                                                                                                                                                      With one SpillInfo to merge, mergeSpills simply renames the spill file to be the output file and returns the partition length array of the one spill.

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#no-spills","title":"No Spills

                                                                                                                                                                                                                                                                                                                      With no SpillInfos to merge, mergeSpills creates an empty output file and returns an array of 0s of size of the numPartitions of the Partitioner.

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#usage","title":"Usage

                                                                                                                                                                                                                                                                                                                      mergeSpills is used when UnsafeShuffleWriter is requested to close internal resources and merge spill files.

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#mergespillswithtransferto","title":"mergeSpillsWithTransferTo
                                                                                                                                                                                                                                                                                                                      long[] mergeSpillsWithTransferTo(\n  SpillInfo[] spills,\n  File outputFile)\n

                                                                                                                                                                                                                                                                                                                      mergeSpillsWithTransferTo...FIXME

                                                                                                                                                                                                                                                                                                                      mergeSpillsWithTransferTo is used when UnsafeShuffleWriter is requested to mergeSpills (with the transferToEnabled flag enabled and no encryption).

                                                                                                                                                                                                                                                                                                                      == [[updatePeakMemoryUsed]] updatePeakMemoryUsed Internal Method

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java","title":"[source, java]","text":""},{"location":"shuffle/UnsafeShuffleWriter/#void-updatepeakmemoryused","title":"void updatePeakMemoryUsed()

                                                                                                                                                                                                                                                                                                                      updatePeakMemoryUsed...FIXME

                                                                                                                                                                                                                                                                                                                      updatePeakMemoryUsed is used when UnsafeShuffleWriter is requested for the <> and to <>.","text":""},{"location":"shuffle/UnsafeShuffleWriter/#writing-key-value-records-of-partition","title":"Writing Key-Value Records of Partition

                                                                                                                                                                                                                                                                                                                      void write(\n  Iterator<Product2<K, V>> records)\n

                                                                                                                                                                                                                                                                                                                      write traverses the input sequence of records (for a RDD partition) and insertRecordIntoSorter one by one. When all the records have been processed, write closes internal resources and merges spill files.

                                                                                                                                                                                                                                                                                                                      In the end, write requests ShuffleExternalSorter to clean up.

                                                                                                                                                                                                                                                                                                                      CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                      When requested to <>, UnsafeShuffleWriter simply <> followed by <> (that, among other things, creates the <>).

                                                                                                                                                                                                                                                                                                                      write is part of the ShuffleWriter abstraction.

                                                                                                                                                                                                                                                                                                                      == [[stop]] Stopping ShuffleWriter

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java_1","title":"[source, java]

                                                                                                                                                                                                                                                                                                                      Option stop( boolean success)

                                                                                                                                                                                                                                                                                                                      stop...FIXME

                                                                                                                                                                                                                                                                                                                      When requested to <>, UnsafeShuffleWriter records the peak execution memory metric and returns the <> (that was created when requested to <>).

                                                                                                                                                                                                                                                                                                                      stop is part of the ShuffleWriter abstraction.

                                                                                                                                                                                                                                                                                                                      == [[insertRecordIntoSorter]] Inserting Record Into ShuffleExternalSorter

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java_2","title":"[source, java]

                                                                                                                                                                                                                                                                                                                      void insertRecordIntoSorter( Product2 record)

                                                                                                                                                                                                                                                                                                                      insertRecordIntoSorter requires that the <> is available.

                                                                                                                                                                                                                                                                                                                      insertRecordIntoSorter requests the <> to reset (so that all currently accumulated output in the output stream is discarded and reusing the already allocated buffer space).

                                                                                                                                                                                                                                                                                                                      insertRecordIntoSorter requests the <> to write out the record (write the serializer:SerializationStream.md#writeKey[key] and the serializer:SerializationStream.md#writeValue[value]) and to serializer:SerializationStream.md#flush[flush].

                                                                                                                                                                                                                                                                                                                      [[insertRecordIntoSorter-serializedRecordSize]] insertRecordIntoSorter requests the <> for the length of the buffer.

                                                                                                                                                                                                                                                                                                                      [[insertRecordIntoSorter-partitionId]] insertRecordIntoSorter requests the <> for the ../rdd/Partitioner.md#getPartition[partition] for the given record (by the key).

                                                                                                                                                                                                                                                                                                                      In the end, insertRecordIntoSorter requests the <> to ShuffleExternalSorter.md#insertRecord[insert] the <> as a byte array (with the <> and the <>).

                                                                                                                                                                                                                                                                                                                      insertRecordIntoSorter is used when UnsafeShuffleWriter is requested to <>.","text":""},{"location":"shuffle/UnsafeShuffleWriter/#closing-and-writing-output-merging-spill-files","title":"Closing and Writing Output (Merging Spill Files)

                                                                                                                                                                                                                                                                                                                      void closeAndWriteOutput()\n

                                                                                                                                                                                                                                                                                                                      closeAndWriteOutput asserts that the ShuffleExternalSorter is created (non-null).

                                                                                                                                                                                                                                                                                                                      closeAndWriteOutput updates peak memory used.

                                                                                                                                                                                                                                                                                                                      closeAndWriteOutput removes the references to the <> and <> output streams (nulls them).

                                                                                                                                                                                                                                                                                                                      closeAndWriteOutput requests the <> to ShuffleExternalSorter.md#closeAndGetSpills[close and return spill metadata].

                                                                                                                                                                                                                                                                                                                      closeAndWriteOutput removes the reference to the <> (nulls it).

                                                                                                                                                                                                                                                                                                                      closeAndWriteOutput requests the <> for the IndexShuffleBlockResolver.md#getDataFile[output data file] for the <> and <> IDs.

                                                                                                                                                                                                                                                                                                                      [[closeAndWriteOutput-partitionLengths]][[closeAndWriteOutput-tmp]] closeAndWriteOutput creates a temporary file (along the data output file) and uses it to <> (that gives a partition length array). All spill files are then deleted.

                                                                                                                                                                                                                                                                                                                      closeAndWriteOutput requests the <> to IndexShuffleBlockResolver.md#writeIndexFileAndCommit[write shuffle index and data files] (for the <> and <> IDs, the <> and the <>).

                                                                                                                                                                                                                                                                                                                      In the end, closeAndWriteOutput creates a scheduler:MapStatus.md[MapStatus] with the storage:BlockManager.md#shuffleServerId[location of the local BlockManager] and the <>.

                                                                                                                                                                                                                                                                                                                      closeAndWriteOutput prints out the following ERROR message to the logs if there is an issue with deleting spill files:

                                                                                                                                                                                                                                                                                                                      Error while deleting spill file [path]\n

                                                                                                                                                                                                                                                                                                                      closeAndWriteOutput prints out the following ERROR message to the logs if there is an issue with deleting the <>:

                                                                                                                                                                                                                                                                                                                      Error while deleting temp file [path]\n

                                                                                                                                                                                                                                                                                                                      closeAndWriteOutput is used when UnsafeShuffleWriter is requested to write records.

                                                                                                                                                                                                                                                                                                                      == [[getPeakMemoryUsedBytes]] Getting Peak Memory Used

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java_3","title":"[source, java]","text":""},{"location":"shuffle/UnsafeShuffleWriter/#long-getpeakmemoryusedbytes","title":"long getPeakMemoryUsedBytes()

                                                                                                                                                                                                                                                                                                                      getPeakMemoryUsedBytes simply <> and returns the internal <> registry.

                                                                                                                                                                                                                                                                                                                      getPeakMemoryUsedBytes is used when UnsafeShuffleWriter is requested to <>.

                                                                                                                                                                                                                                                                                                                      == [[open]] Opening UnsafeShuffleWriter and Buffers

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java_4","title":"[source, java]","text":""},{"location":"shuffle/UnsafeShuffleWriter/#void-open","title":"void open()

                                                                                                                                                                                                                                                                                                                      open requires that there is no <> available.

                                                                                                                                                                                                                                                                                                                      open creates a ShuffleExternalSorter.md[ShuffleExternalSorter].

                                                                                                                                                                                                                                                                                                                      open creates a <> with the capacity of <>.

                                                                                                                                                                                                                                                                                                                      open requests the <> for a serializer:SerializerInstance.md#serializeStream[SerializationStream] to the <> (available internally as the <> reference).

                                                                                                                                                                                                                                                                                                                      open is used when UnsafeShuffleWriter is <>.

                                                                                                                                                                                                                                                                                                                      == [[logging]] Logging

                                                                                                                                                                                                                                                                                                                      Enable ALL logging level for org.apache.spark.shuffle.sort.UnsafeShuffleWriter logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                      Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"shuffle/UnsafeShuffleWriter/#log4jloggerorgapachesparkshufflesortunsafeshufflewriterall","title":"log4j.logger.org.apache.spark.shuffle.sort.UnsafeShuffleWriter=ALL

                                                                                                                                                                                                                                                                                                                      Refer to spark-logging.md[Logging].

                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#internal-properties","title":"Internal Properties","text":""},{"location":"shuffle/UnsafeShuffleWriter/#mapstatus","title":"MapStatus

                                                                                                                                                                                                                                                                                                                      MapStatus

                                                                                                                                                                                                                                                                                                                      Created when UnsafeShuffleWriter is requested to <> (with the storage:BlockManagerId.md[] of the <> and partitionLengths)

                                                                                                                                                                                                                                                                                                                      Returned when UnsafeShuffleWriter is requested to <>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#partitioner","title":"Partitioner

                                                                                                                                                                                                                                                                                                                      Partitioner (as used by the BaseShuffleHandle.md#dependency[ShuffleDependency] of the <>)

                                                                                                                                                                                                                                                                                                                      Used when UnsafeShuffleWriter is requested for the following:

                                                                                                                                                                                                                                                                                                                      • <> (and create a ShuffleExternalSorter.md[ShuffleExternalSorter] with the given ../rdd/Partitioner.md#numPartitions[number of partitions])

                                                                                                                                                                                                                                                                                                                      • <> (and request the ../rdd/Partitioner.md#getPartition[partition for the key])

                                                                                                                                                                                                                                                                                                                      • <>, <> and <> (for the ../rdd/Partitioner.md#numPartitions[number of partitions] to create partition lengths)","text":""},{"location":"shuffle/UnsafeShuffleWriter/#peak-memory-used","title":"Peak Memory Used

                                                                                                                                                                                                                                                                                                                        Peak memory used (in bytes) that is updated exclusively in <> (after requesting the <> for ShuffleExternalSorter.md#getPeakMemoryUsedBytes[getPeakMemoryUsedBytes])

                                                                                                                                                                                                                                                                                                                        Use <> to access the current value","text":""},{"location":"shuffle/UnsafeShuffleWriter/#bytearrayoutputstream-for-serialized-data","title":"ByteArrayOutputStream for Serialized Data

                                                                                                                                                                                                                                                                                                                        {java-javadoc-url}/java/io/ByteArrayOutputStream.html[java.io.ByteArrayOutputStream] of serialized data (written into a byte array of <> initial size)

                                                                                                                                                                                                                                                                                                                        Used when UnsafeShuffleWriter is requested for the following:

                                                                                                                                                                                                                                                                                                                        • <> (and create the internal <>)

                                                                                                                                                                                                                                                                                                                        • <>

                                                                                                                                                                                                                                                                                                                          Destroyed (null) when requested to <>.

                                                                                                                                                                                                                                                                                                                          === [[serializer]] serializer

                                                                                                                                                                                                                                                                                                                          serializer:SerializerInstance.md[SerializerInstance] (that is a new instance of the Serializer of the BaseShuffleHandle.md#dependency[ShuffleDependency] of the <>)

                                                                                                                                                                                                                                                                                                                          Used exclusively when UnsafeShuffleWriter is requested to <> (and creates the <>)

                                                                                                                                                                                                                                                                                                                          === [[serOutputStream]] serOutputStream

                                                                                                                                                                                                                                                                                                                          serializer:SerializationStream.md[SerializationStream] (that is created when the <> is requested to serializer:SerializerInstance.md#serializeStream[serializeStream] with the <>)

                                                                                                                                                                                                                                                                                                                          Used when UnsafeShuffleWriter is requested to <>

                                                                                                                                                                                                                                                                                                                          Destroyed (null) when requested to <>.","text":""},{"location":"shuffle/UnsafeShuffleWriter/#shuffle-id","title":"Shuffle ID

                                                                                                                                                                                                                                                                                                                          Shuffle ID (of the ShuffleDependency of the SerializedShuffleHandle)

                                                                                                                                                                                                                                                                                                                          Used exclusively when requested to <>

                                                                                                                                                                                                                                                                                                                          === [[writeMetrics]] writeMetrics

                                                                                                                                                                                                                                                                                                                          executor:ShuffleWriteMetrics.md[] (of the TaskMetrics of the <>)

                                                                                                                                                                                                                                                                                                                          Used when UnsafeShuffleWriter is requested for the following:

                                                                                                                                                                                                                                                                                                                          • <> (and creates the <>)

                                                                                                                                                                                                                                                                                                                          • <>

                                                                                                                                                                                                                                                                                                                          • <>

                                                                                                                                                                                                                                                                                                                          • <>","text":""},{"location":"stage-level-scheduling/","title":"Stage-Level Scheduling","text":"

                                                                                                                                                                                                                                                                                                                            Stage-Level Scheduling uses ResourceProfiles for the following:

                                                                                                                                                                                                                                                                                                                            • Spark developers can specify task and executor resource requirements at stage level
                                                                                                                                                                                                                                                                                                                            • Spark (Scheduler) uses the stage-level requirements to acquire the necessary resources and executors and schedule tasks based on the per-stage requirements

                                                                                                                                                                                                                                                                                                                            Apache Spark 3.1.1

                                                                                                                                                                                                                                                                                                                            Stage-Level Scheduling was introduced in Apache Spark 3.1.1 (cf. SPARK-27495)

                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/#resource-profiles","title":"Resource Profiles","text":"

                                                                                                                                                                                                                                                                                                                            Resource Profiles are managed by ResourceProfileManager.

                                                                                                                                                                                                                                                                                                                            The Default ResourceProfile is known by ID 0.

                                                                                                                                                                                                                                                                                                                            Custom Resource Profiles are ResourceProfiles with non-0 IDs. Custom Resource Profiles are only supported on YARN, Kubernetes and Spark Standalone.

                                                                                                                                                                                                                                                                                                                            ResourceProfiles are associated with an RDD using withResources operator.

                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/#resource-requests","title":"Resource Requests","text":""},{"location":"stage-level-scheduling/#executor","title":"Executor","text":"

                                                                                                                                                                                                                                                                                                                            Executor Resource Requests are specified using executorResources of a ResourceProfile.

                                                                                                                                                                                                                                                                                                                            Executor Resource Requests can be the following built-in resources:

                                                                                                                                                                                                                                                                                                                            • cores
                                                                                                                                                                                                                                                                                                                            • memory
                                                                                                                                                                                                                                                                                                                            • memoryOverhead
                                                                                                                                                                                                                                                                                                                            • pyspark.memory
                                                                                                                                                                                                                                                                                                                            • offHeap

                                                                                                                                                                                                                                                                                                                            Other (deployment environment-specific) executor resource requests can be defined as Custom Executor Resources.

                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/#task","title":"Task","text":"

                                                                                                                                                                                                                                                                                                                            Default Task Resources are specified based on spark.task.cpus and spark.task.resource-prefixed configuration properties.

                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/#sparklistenerresourceprofileadded","title":"SparkListenerResourceProfileAdded","text":"

                                                                                                                                                                                                                                                                                                                            ResourceProfiles can be monitored using SparkListenerResourceProfileAdded.

                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/#dynamic-allocation","title":"Dynamic Allocation","text":"

                                                                                                                                                                                                                                                                                                                            Dynamic Allocation of Executors is not supported.

                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/#demo","title":"Demo","text":""},{"location":"stage-level-scheduling/#describe-distributed-computation","title":"Describe Distributed Computation","text":"

                                                                                                                                                                                                                                                                                                                            Let's describe a distributed computation (using RDD API) over a 10-record dataset.

                                                                                                                                                                                                                                                                                                                            val rdd = sc.range(0, 9)\n
                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/#describe-required-resources","title":"Describe Required Resources","text":"

                                                                                                                                                                                                                                                                                                                            Optional Step

                                                                                                                                                                                                                                                                                                                            This demo assumes to be executed in local deployment mode (that supports the default ResourceProfile only) and so the step is considered optional until a supported cluster manager is used.

                                                                                                                                                                                                                                                                                                                            import org.apache.spark.resource.ResourceProfileBuilder\nval rpb = new ResourceProfileBuilder\nval rp1 = rpb.build()\n
                                                                                                                                                                                                                                                                                                                            scala> println(rp1.toString)\nProfile: id = 1, executor resources: , task resources:\n
                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/#configure-default-resourceprofile","title":"Configure Default ResourceProfile","text":"

                                                                                                                                                                                                                                                                                                                            FIXME

                                                                                                                                                                                                                                                                                                                            Use spark.task.resource-prefixed properties per ResourceUtils.

                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/#associate-required-resources-to-distributed-computation","title":"Associate Required Resources to Distributed Computation","text":"
                                                                                                                                                                                                                                                                                                                            rdd.withResources(rp1)\n
                                                                                                                                                                                                                                                                                                                            scala> rdd.withResources(rp1)\norg.apache.spark.SparkException: TaskResourceProfiles are only supported for Standalone cluster for now when dynamic allocation is disabled.\n  at org.apache.spark.resource.ResourceProfileManager.isSupported(ResourceProfileManager.scala:71)\n  at org.apache.spark.resource.ResourceProfileManager.addResourceProfile(ResourceProfileManager.scala:126)\n  at org.apache.spark.rdd.RDD.withResources(RDD.scala:1802)\n  ... 42 elided\n
                                                                                                                                                                                                                                                                                                                            SPARK-43912

                                                                                                                                                                                                                                                                                                                            Reported as SPARK-43912 Incorrect SparkException for Stage-Level Scheduling in local mode.

                                                                                                                                                                                                                                                                                                                            Until it is fixed, enable Dynamic Allocation.

                                                                                                                                                                                                                                                                                                                            $ ./bin/spark-shell -c spark.dynamicAllocation.enabled=true\n
                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/ExecutorResourceInfo/","title":"ExecutorResourceInfo","text":"

                                                                                                                                                                                                                                                                                                                            ExecutorResourceInfo is a ResourceAllocator.

                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/ExecutorResourceInfo/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                            ExecutorResourceInfo takes the following to be created:

                                                                                                                                                                                                                                                                                                                            • Resource Name
                                                                                                                                                                                                                                                                                                                            • Addresses
                                                                                                                                                                                                                                                                                                                            • Number of slots (per address)

                                                                                                                                                                                                                                                                                                                              ExecutorResourceInfo is created when:

                                                                                                                                                                                                                                                                                                                              • DriverEndpoint is requested to handle a RegisterExecutor event
                                                                                                                                                                                                                                                                                                                              "},{"location":"stage-level-scheduling/ExecutorResourceRequest/","title":"ExecutorResourceRequest","text":""},{"location":"stage-level-scheduling/ExecutorResourceRequest/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                              ExecutorResourceRequest takes the following to be created:

                                                                                                                                                                                                                                                                                                                              • Resource Name
                                                                                                                                                                                                                                                                                                                              • Amount
                                                                                                                                                                                                                                                                                                                              • Discovery Script
                                                                                                                                                                                                                                                                                                                              • Vendor

                                                                                                                                                                                                                                                                                                                                ExecutorResourceRequest is created when:

                                                                                                                                                                                                                                                                                                                                • ExecutorResourceRequests is requested to memory, offHeapMemory, memoryOverhead, pysparkMemory, cores and resource
                                                                                                                                                                                                                                                                                                                                • JsonProtocol utility is used to executorResourceRequestFromJson
                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ExecutorResourceRequest/#serializable","title":"Serializable","text":"

                                                                                                                                                                                                                                                                                                                                ExecutorResourceRequest is a Serializable (Java).

                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ExecutorResourceRequests/","title":"ExecutorResourceRequests","text":"

                                                                                                                                                                                                                                                                                                                                ExecutorResourceRequests is a set of ExecutorResourceRequests for Spark developers to (programmatically) specify resources for an RDD to be applied at stage level:

                                                                                                                                                                                                                                                                                                                                • cores
                                                                                                                                                                                                                                                                                                                                • memory
                                                                                                                                                                                                                                                                                                                                • memoryOverhead
                                                                                                                                                                                                                                                                                                                                • offHeap
                                                                                                                                                                                                                                                                                                                                • pyspark.memory
                                                                                                                                                                                                                                                                                                                                • custom resource
                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ExecutorResourceRequests/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                ExecutorResourceRequests takes no arguments to be created.

                                                                                                                                                                                                                                                                                                                                ExecutorResourceRequests is created when:

                                                                                                                                                                                                                                                                                                                                • ResourceProfile utility is used to get the default executor resource requests (for tasks)
                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ExecutorResourceRequests/#serializable","title":"Serializable","text":"

                                                                                                                                                                                                                                                                                                                                ExecutorResourceRequests is a Serializable (Java).

                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ExecutorResourceRequests/#resource","title":"resource
                                                                                                                                                                                                                                                                                                                                resource(\n  resourceName: String,\n  amount: Long,\n  discoveryScript: String = \"\",\n  vendor: String = \"\"): this.type\n

                                                                                                                                                                                                                                                                                                                                resource creates a ExecutorResourceRequest and registers it under resourceName.

                                                                                                                                                                                                                                                                                                                                resource is used when:

                                                                                                                                                                                                                                                                                                                                • ResourceProfile utility is used for the default executor resources
                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"stage-level-scheduling/ExecutorResourceRequests/#text-representation","title":"Text Representation

                                                                                                                                                                                                                                                                                                                                ExecutorResourceRequests presents itself as:

                                                                                                                                                                                                                                                                                                                                Executor resource requests: [_executorResources]\n
                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"stage-level-scheduling/ExecutorResourceRequests/#demo","title":"Demo
                                                                                                                                                                                                                                                                                                                                import org.apache.spark.resource.ExecutorResourceRequests\nval executorResources = new ExecutorResourceRequests()\n  .memory(\"2g\")\n  .memoryOverhead(\"512m\")\n  .cores(8)\n  .resource(\n    resourceName = \"my-custom-resource\",\n    amount = 1,\n    discoveryScript = \"/this/is/path/to/discovery/script.sh\",\n    vendor = \"pl.japila\")\n
                                                                                                                                                                                                                                                                                                                                scala> println(executorResources)\nExecutor resource requests: {memoryOverhead=name: memoryOverhead, amount: 512, script: , vendor: , memory=name: memory, amount: 2048, script: , vendor: , cores=name: cores, amount: 8, script: , vendor: , my-custom-resource=name: my-custom-resource, amount: 1, script: /this/is/path/to/discovery/script.sh, vendor: pl.japila}\n
                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"stage-level-scheduling/ResourceAllocator/","title":"ResourceAllocator","text":"

                                                                                                                                                                                                                                                                                                                                ResourceAllocator is an abstraction of resource allocators.

                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceAllocator/#contract","title":"Contract","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#resourceaddresses","title":"resourceAddresses
                                                                                                                                                                                                                                                                                                                                resourceAddresses: Seq[String]\n

                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                • ResourceAllocator is requested for the addressAvailabilityMap
                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#resourcename","title":"resourceName
                                                                                                                                                                                                                                                                                                                                resourceName: String\n

                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                • ResourceAllocator is requested to acquire and release addresses
                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#slotsperaddress","title":"slotsPerAddress
                                                                                                                                                                                                                                                                                                                                slotsPerAddress: Int\n

                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                • ResourceAllocator is requested for the addressAvailabilityMap, assignedAddrs and to release
                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                • ExecutorResourceInfo
                                                                                                                                                                                                                                                                                                                                • WorkerResourceInfo (Spark Standalone)
                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceAllocator/#acquiring-addresses","title":"Acquiring Addresses
                                                                                                                                                                                                                                                                                                                                acquire(\n  addrs: Seq[String]): Unit\n

                                                                                                                                                                                                                                                                                                                                acquire...FIXME

                                                                                                                                                                                                                                                                                                                                acquire is used when:

                                                                                                                                                                                                                                                                                                                                • DriverEndpoint is requested to launchTasks
                                                                                                                                                                                                                                                                                                                                • WorkerResourceInfo (Spark Standalone) is requested to acquire and recoverResources
                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#releasing-addresses","title":"Releasing Addresses
                                                                                                                                                                                                                                                                                                                                release(\n  addrs: Seq[String]): Unit\n

                                                                                                                                                                                                                                                                                                                                release...FIXME

                                                                                                                                                                                                                                                                                                                                release is used when:

                                                                                                                                                                                                                                                                                                                                • DriverEndpoint is requested to handle a StatusUpdate event
                                                                                                                                                                                                                                                                                                                                • WorkerInfo (Spark Standalone) is requested to releaseResources
                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#assignedaddrs","title":"assignedAddrs
                                                                                                                                                                                                                                                                                                                                assignedAddrs: Seq[String]\n

                                                                                                                                                                                                                                                                                                                                assignedAddrs...FIXME

                                                                                                                                                                                                                                                                                                                                assignedAddrs is used when:

                                                                                                                                                                                                                                                                                                                                • WorkerInfo (Spark Standalone) is requested for the resourcesInfoUsed
                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#availableaddrs","title":"availableAddrs
                                                                                                                                                                                                                                                                                                                                availableAddrs: Seq[String]\n

                                                                                                                                                                                                                                                                                                                                availableAddrs...FIXME

                                                                                                                                                                                                                                                                                                                                availableAddrs is used when:

                                                                                                                                                                                                                                                                                                                                • WorkerInfo (Spark Standalone) is requested for the resourcesInfoFree
                                                                                                                                                                                                                                                                                                                                • WorkerResourceInfo (Spark Standalone) is requested to acquire and resourcesAmountFree
                                                                                                                                                                                                                                                                                                                                • DriverEndpoint is requested to makeOffers
                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#addressavailabilitymap","title":"addressAvailabilityMap
                                                                                                                                                                                                                                                                                                                                addressAvailabilityMap: Seq[String]\n

                                                                                                                                                                                                                                                                                                                                addressAvailabilityMap...FIXME

                                                                                                                                                                                                                                                                                                                                Lazy Value

                                                                                                                                                                                                                                                                                                                                addressAvailabilityMap is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

                                                                                                                                                                                                                                                                                                                                Learn more in the Scala Language Specification.

                                                                                                                                                                                                                                                                                                                                addressAvailabilityMap is used when:

                                                                                                                                                                                                                                                                                                                                • ResourceAllocator is requested to availableAddrs, assignedAddrs, acquire, release
                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"stage-level-scheduling/ResourceID/","title":"ResourceID","text":"

                                                                                                                                                                                                                                                                                                                                ResourceID is...FIXME

                                                                                                                                                                                                                                                                                                                                ","tags":["DeveloperApi"]},{"location":"stage-level-scheduling/ResourceProfile/","title":"ResourceProfile","text":"

                                                                                                                                                                                                                                                                                                                                ResourceProfile is a resource profile that describes executor and task requirements of an RDD in Stage-Level Scheduling.

                                                                                                                                                                                                                                                                                                                                ResourceProfile can be associated with an RDD using RDD.withResources method.

                                                                                                                                                                                                                                                                                                                                The ResourceProfile of an RDD is available using RDD.getResourceProfile method.

                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceProfile/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                ResourceProfile takes the following to be created:

                                                                                                                                                                                                                                                                                                                                • Executor Resources (Map[String, ExecutorResourceRequest])
                                                                                                                                                                                                                                                                                                                                • Task Resources (Map[String, TaskResourceRequest])

                                                                                                                                                                                                                                                                                                                                  ResourceProfile is created (directly or using getOrCreateDefaultProfile)\u00a0when:

                                                                                                                                                                                                                                                                                                                                  • DriverEndpoint is requested to handle a RetrieveSparkAppConfig message
                                                                                                                                                                                                                                                                                                                                  • ResourceProfileBuilder utility is requested to build
                                                                                                                                                                                                                                                                                                                                  "},{"location":"stage-level-scheduling/ResourceProfile/#allSupportedExecutorResources","title":"Built-In Executor Resources","text":"

                                                                                                                                                                                                                                                                                                                                  ResourceProfile defines the following names as the Supported Executor Resources (among the specified executorResources):

                                                                                                                                                                                                                                                                                                                                  • cores
                                                                                                                                                                                                                                                                                                                                  • memory
                                                                                                                                                                                                                                                                                                                                  • memoryOverhead
                                                                                                                                                                                                                                                                                                                                  • pyspark.memory
                                                                                                                                                                                                                                                                                                                                  • offHeap

                                                                                                                                                                                                                                                                                                                                  All other executor resources (names) are considered Custom Executor Resources.

                                                                                                                                                                                                                                                                                                                                  "},{"location":"stage-level-scheduling/ResourceProfile/#getCustomExecutorResources","title":"Custom Executor Resources","text":"
                                                                                                                                                                                                                                                                                                                                  getCustomExecutorResources(): Map[String, ExecutorResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                  getCustomExecutorResources is the Executor Resources that are not supported executor resources.

                                                                                                                                                                                                                                                                                                                                  getCustomExecutorResources is used when:

                                                                                                                                                                                                                                                                                                                                  • ApplicationDescription is requested to resourceReqsPerExecutor
                                                                                                                                                                                                                                                                                                                                  • ApplicationInfo is requested to createResourceDescForResourceProfile
                                                                                                                                                                                                                                                                                                                                  • ResourceProfile is requested to calculateTasksAndLimitingResource
                                                                                                                                                                                                                                                                                                                                  • ResourceUtils is requested to getOrDiscoverAllResourcesForResourceProfile, warnOnWastedResources
                                                                                                                                                                                                                                                                                                                                  "},{"location":"stage-level-scheduling/ResourceProfile/#limitingResource","title":"Limiting Resource","text":"
                                                                                                                                                                                                                                                                                                                                  limitingResource(\n  sparkConf: SparkConf): String\n

                                                                                                                                                                                                                                                                                                                                  limitingResource takes the _limitingResource, if calculated earlier, or calculateTasksAndLimitingResource.

                                                                                                                                                                                                                                                                                                                                  limitingResource is used when:

                                                                                                                                                                                                                                                                                                                                  • ResourceProfileManager is requested to add a new ResourceProfile (to recompute a limiting resource eagerly)
                                                                                                                                                                                                                                                                                                                                  • ResourceUtils is requested to warnOnWastedResources (for reporting purposes only)
                                                                                                                                                                                                                                                                                                                                  "},{"location":"stage-level-scheduling/ResourceProfile/#_limitingResource","title":"_limitingResource","text":"
                                                                                                                                                                                                                                                                                                                                  _limitingResource: Option[String] = None\n

                                                                                                                                                                                                                                                                                                                                  ResourceProfile defines _limitingResource variable that is determined (if there is one) while calculateTasksAndLimitingResource.

                                                                                                                                                                                                                                                                                                                                  _limitingResource can be the following:

                                                                                                                                                                                                                                                                                                                                  • A \"special\" empty resource identifier (that is assumed cpus in TaskSchedulerImpl)
                                                                                                                                                                                                                                                                                                                                  • cpus built-in task resource identifier
                                                                                                                                                                                                                                                                                                                                  • any custom resource identifier
                                                                                                                                                                                                                                                                                                                                  "},{"location":"stage-level-scheduling/ResourceProfile/#defaultProfile","title":"Default Profile","text":"

                                                                                                                                                                                                                                                                                                                                  ResourceProfile (Scala object) defines defaultProfile internal registry for the default ResourceProfile (per JVM instance).

                                                                                                                                                                                                                                                                                                                                  defaultProfile is undefined (None) and gets a new ResourceProfile when first requested.

                                                                                                                                                                                                                                                                                                                                  defaultProfile can be accessed using getOrCreateDefaultProfile.

                                                                                                                                                                                                                                                                                                                                  defaultProfile is cleared (removed) in clearDefaultProfile.

                                                                                                                                                                                                                                                                                                                                  "},{"location":"stage-level-scheduling/ResourceProfile/#getOrCreateDefaultProfile","title":"getOrCreateDefaultProfile","text":"
                                                                                                                                                                                                                                                                                                                                  getOrCreateDefaultProfile(\n  conf: SparkConf): ResourceProfile\n

                                                                                                                                                                                                                                                                                                                                  getOrCreateDefaultProfile returns the default profile (if already defined) or creates a new one.

                                                                                                                                                                                                                                                                                                                                  Unless defined, getOrCreateDefaultProfile creates a ResourceProfile with the default task and executor resource descriptions and makes it the defaultProfile.

                                                                                                                                                                                                                                                                                                                                  getOrCreateDefaultProfile prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                  Default ResourceProfile created,\nexecutor resources: [executorResources], task resources: [taskResources]\n

                                                                                                                                                                                                                                                                                                                                  getOrCreateDefaultProfile\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                  • TaskResourceProfile is requested to getCustomExecutorResources
                                                                                                                                                                                                                                                                                                                                  • ResourceProfile is requested to getDefaultProfileExecutorResources
                                                                                                                                                                                                                                                                                                                                  • ResourceProfileManager is created
                                                                                                                                                                                                                                                                                                                                  • YarnAllocator (Spark on YARN) is requested to initDefaultProfile
                                                                                                                                                                                                                                                                                                                                  "},{"location":"stage-level-scheduling/ResourceProfile/#getDefaultExecutorResources","title":"Default Executor Resource Requests","text":"
                                                                                                                                                                                                                                                                                                                                  getDefaultExecutorResources(\n  conf: SparkConf): Map[String, ExecutorResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                  getDefaultExecutorResources creates an ExecutorResourceRequests with the following:

                                                                                                                                                                                                                                                                                                                                  Property Configuration Property cores spark.executor.cores memory spark.executor.memory memoryOverhead spark.executor.memoryOverhead pysparkMemory spark.executor.pyspark.memory offHeapMemory spark.memory.offHeap.size

                                                                                                                                                                                                                                                                                                                                  getDefaultExecutorResources finds executor resource requests (with the spark.executor component name in the given SparkConf) for ExecutorResourceRequests.

                                                                                                                                                                                                                                                                                                                                  getDefaultExecutorResources initializes the defaultProfileExecutorResources (with the executor resource requests).

                                                                                                                                                                                                                                                                                                                                  In the end, getDefaultExecutorResources requests the ExecutorResourceRequests for all the resource requests

                                                                                                                                                                                                                                                                                                                                  "},{"location":"stage-level-scheduling/ResourceProfile/#getDefaultTaskResources","title":"Default Task Resource Requests","text":"
                                                                                                                                                                                                                                                                                                                                  getDefaultTaskResources(\n  conf: SparkConf): Map[String, TaskResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                  getDefaultTaskResources creates a new TaskResourceRequests with the cpus based on spark.task.cpus configuration property.

                                                                                                                                                                                                                                                                                                                                  getDefaultTaskResources adds task resource requests (configured in the given SparkConf using spark.task.resource-prefixed properties).

                                                                                                                                                                                                                                                                                                                                  In the end, getDefaultTaskResources requests the TaskResourceRequests for the requests.

                                                                                                                                                                                                                                                                                                                                  "},{"location":"stage-level-scheduling/ResourceProfile/#getresourcesforclustermanager","title":"getResourcesForClusterManager
                                                                                                                                                                                                                                                                                                                                  getResourcesForClusterManager(\n  rpId: Int,\n  execResources: Map[String, ExecutorResourceRequest],\n  overheadFactor: Double,\n  conf: SparkConf,\n  isPythonApp: Boolean,\n  resourceMappings: Map[String, String]): ExecutorResourcesOrDefaults\n

                                                                                                                                                                                                                                                                                                                                  getResourcesForClusterManager takes the DefaultProfileExecutorResources.

                                                                                                                                                                                                                                                                                                                                  getResourcesForClusterManager calculates the overhead memory with the following:

                                                                                                                                                                                                                                                                                                                                  • memoryOverheadMiB and executorMemoryMiB of the DefaultProfileExecutorResources
                                                                                                                                                                                                                                                                                                                                  • Given overheadFactor

                                                                                                                                                                                                                                                                                                                                  If the given rpId resource profile ID is not the default ID (0), getResourcesForClusterManager...FIXME (there is so much to \"digest\")

                                                                                                                                                                                                                                                                                                                                  getResourcesForClusterManager...FIXME

                                                                                                                                                                                                                                                                                                                                  In the end, getResourcesForClusterManager creates a ExecutorResourcesOrDefaults.

                                                                                                                                                                                                                                                                                                                                  getResourcesForClusterManager is used when:

                                                                                                                                                                                                                                                                                                                                  • BasicExecutorFeatureStep (Spark on Kubernetes) is created
                                                                                                                                                                                                                                                                                                                                  • YarnAllocator (Spark on YARN) is requested to createYarnResourceForResourceProfile
                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"stage-level-scheduling/ResourceProfile/#getDefaultProfileExecutorResources","title":"getDefaultProfileExecutorResources
                                                                                                                                                                                                                                                                                                                                  getDefaultProfileExecutorResources(\n  conf: SparkConf): DefaultProfileExecutorResources\n

                                                                                                                                                                                                                                                                                                                                  getDefaultProfileExecutorResources...FIXME

                                                                                                                                                                                                                                                                                                                                  getDefaultProfileExecutorResources is used when:

                                                                                                                                                                                                                                                                                                                                  • ResourceProfile is requested to getResourcesForClusterManager
                                                                                                                                                                                                                                                                                                                                  • YarnAllocator (Spark on YARN) is requested to runAllocatedContainers
                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"stage-level-scheduling/ResourceProfile/#serializable","title":"Serializable

                                                                                                                                                                                                                                                                                                                                  ResourceProfile is a Java Serializable.

                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"stage-level-scheduling/ResourceProfile/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                  Enable ALL logging level for org.apache.spark.resource.ResourceProfile logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                  Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                                                                                                  logger.ResourceProfile.name = org.apache.spark.resource.ResourceProfile\nlogger.ResourceProfile.level = all\n

                                                                                                                                                                                                                                                                                                                                  Refer to Logging.

                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"stage-level-scheduling/ResourceProfileBuilder/","title":"ResourceProfileBuilder","text":"

                                                                                                                                                                                                                                                                                                                                  ResourceProfileBuilder is a fluent API for Spark developers to build ResourceProfiles (to associate with an RDD).

                                                                                                                                                                                                                                                                                                                                  Available in Scala and Python APIs

                                                                                                                                                                                                                                                                                                                                  ResourceProfileBuilder is available in Scala and Python APIs.

                                                                                                                                                                                                                                                                                                                                  "},{"location":"stage-level-scheduling/ResourceProfileBuilder/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                  ResourceProfileBuilder takes no arguments to be created.

                                                                                                                                                                                                                                                                                                                                  "},{"location":"stage-level-scheduling/ResourceProfileBuilder/#build","title":"Building ResourceProfile","text":"
                                                                                                                                                                                                                                                                                                                                  build: ResourceProfile\n

                                                                                                                                                                                                                                                                                                                                  build creates a ResourceProfile:

                                                                                                                                                                                                                                                                                                                                  • TaskResourceProfile when _executorResources are undefined
                                                                                                                                                                                                                                                                                                                                  • ResourceProfile with the executorResources and the taskResources
                                                                                                                                                                                                                                                                                                                                  "},{"location":"stage-level-scheduling/ResourceProfileBuilder/#executorResources","title":"Executor Resources","text":"
                                                                                                                                                                                                                                                                                                                                  executorResources: Map[String, ExecutorResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                  executorResources...FIXME

                                                                                                                                                                                                                                                                                                                                  "},{"location":"stage-level-scheduling/ResourceProfileBuilder/#taskResources","title":"Task Resources
                                                                                                                                                                                                                                                                                                                                  taskResources: Map[String, TaskResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                  taskResources is TaskResourceRequests specified by users (by their resource names)

                                                                                                                                                                                                                                                                                                                                  taskResources are specified using require method.

                                                                                                                                                                                                                                                                                                                                  taskResources can be removed using clearTaskResourceRequests method.

                                                                                                                                                                                                                                                                                                                                  taskResources can be printed out using toString method.

                                                                                                                                                                                                                                                                                                                                  taskResources is used when:

                                                                                                                                                                                                                                                                                                                                  • ResourceProfileBuilder is requested to build a ResourceProfile
                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"stage-level-scheduling/ResourceProfileBuilder/#demo","title":"Demo","text":"
                                                                                                                                                                                                                                                                                                                                  import org.apache.spark.resource.ResourceProfileBuilder\nval rp1 = new ResourceProfileBuilder()\n\nimport org.apache.spark.resource.ExecutorResourceRequests\nval execReqs = new ExecutorResourceRequests().cores(4).resource(\"gpu\", 4)\n\nimport org.apache.spark.resource.ExecutorResourceRequests\nval taskReqs = new TaskResourceRequests().cpus(1).resource(\"gpu\", 1)\n\nrp1.require(execReqs).require(taskReqs)\nval rprof1 = rp1.build\n
                                                                                                                                                                                                                                                                                                                                  val rpManager = sc.resourceProfileManager // (1)!\nrpManager.addResourceProfile(rprof1)\n
                                                                                                                                                                                                                                                                                                                                  1. NOTE: resourceProfileManager is private[spark]
                                                                                                                                                                                                                                                                                                                                  "},{"location":"stage-level-scheduling/ResourceProfileManager/","title":"ResourceProfileManager","text":"

                                                                                                                                                                                                                                                                                                                                  ResourceProfileManager manages ResourceProfiles.

                                                                                                                                                                                                                                                                                                                                  "},{"location":"stage-level-scheduling/ResourceProfileManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                  ResourceProfileManager takes the following to be created:

                                                                                                                                                                                                                                                                                                                                  • SparkConf
                                                                                                                                                                                                                                                                                                                                  • LiveListenerBus

                                                                                                                                                                                                                                                                                                                                    ResourceProfileManager is created when:

                                                                                                                                                                                                                                                                                                                                    • SparkContext is created
                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/ResourceProfileManager/#accessing-resourceprofilemanager","title":"Accessing ResourceProfileManager","text":"

                                                                                                                                                                                                                                                                                                                                    ResourceProfileManager is available to other Spark services using SparkContext.

                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/ResourceProfileManager/#resourceProfileIdToResourceProfile","title":"Registered ResourceProfiles","text":"
                                                                                                                                                                                                                                                                                                                                    resourceProfileIdToResourceProfile: HashMap[Int, ResourceProfile]\n

                                                                                                                                                                                                                                                                                                                                    ResourceProfileManager creates resourceProfileIdToResourceProfile registry of ResourceProfiles by their ID.

                                                                                                                                                                                                                                                                                                                                    A new ResourceProfile is added when addResourceProfile.

                                                                                                                                                                                                                                                                                                                                    ResourceProfiles are resolved (looked up) using resourceProfileFromId.

                                                                                                                                                                                                                                                                                                                                    ResourceProfiles can be equivalent when they specify the same resources.

                                                                                                                                                                                                                                                                                                                                    resourceProfileIdToResourceProfile is used when:

                                                                                                                                                                                                                                                                                                                                    • canBeScheduled
                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/ResourceProfileManager/#defaultProfile","title":"Default ResourceProfile","text":"

                                                                                                                                                                                                                                                                                                                                    ResourceProfileManager gets or creates the default ResourceProfile when created and registers it immediately.

                                                                                                                                                                                                                                                                                                                                    The default profile is available as defaultResourceProfile.

                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/ResourceProfileManager/#defaultResourceProfile","title":"Accessing Default ResourceProfile","text":"
                                                                                                                                                                                                                                                                                                                                    defaultResourceProfile: ResourceProfile\n

                                                                                                                                                                                                                                                                                                                                    defaultResourceProfile returns the default ResourceProfile.

                                                                                                                                                                                                                                                                                                                                    defaultResourceProfile is used when:

                                                                                                                                                                                                                                                                                                                                    • ExecutorAllocationManager is created
                                                                                                                                                                                                                                                                                                                                    • SparkContext is requested to requestTotalExecutors and createTaskScheduler
                                                                                                                                                                                                                                                                                                                                    • DAGScheduler is requested to mergeResourceProfilesForStage
                                                                                                                                                                                                                                                                                                                                    • CoarseGrainedSchedulerBackend is requested to requestExecutors
                                                                                                                                                                                                                                                                                                                                    • StandaloneSchedulerBackend (Spark Standalone) is created
                                                                                                                                                                                                                                                                                                                                    • KubernetesClusterSchedulerBackend (Spark on Kubernetes) is created
                                                                                                                                                                                                                                                                                                                                    • MesosCoarseGrainedSchedulerBackend (Spark on Mesos) is created
                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/ResourceProfileManager/#addResourceProfile","title":"Registering ResourceProfile","text":"
                                                                                                                                                                                                                                                                                                                                    addResourceProfile(\n  rp: ResourceProfile): Unit\n

                                                                                                                                                                                                                                                                                                                                    addResourceProfile checks if the given ResourceProfile is supported.

                                                                                                                                                                                                                                                                                                                                    addResourceProfile registers the given ResourceProfile (in the resourceProfileIdToResourceProfile registry) unless done earlier (by ResourceProfile ID).

                                                                                                                                                                                                                                                                                                                                    With a new ResourceProfile, addResourceProfile requests the given ResourceProfile for the limiting resource (for no reason but to calculate it upfront) and prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                    Added ResourceProfile id: [id]\n

                                                                                                                                                                                                                                                                                                                                    In the end (for a new ResourceProfile), addResourceProfile requests the LiveListenerBus to post a SparkListenerResourceProfileAdded.

                                                                                                                                                                                                                                                                                                                                    addResourceProfile is used when:

                                                                                                                                                                                                                                                                                                                                    • RDD.withResources operator is used
                                                                                                                                                                                                                                                                                                                                    • ResourceProfileManager is created (and registers the default profile)
                                                                                                                                                                                                                                                                                                                                    • DAGScheduler is requested to mergeResourceProfilesForStage
                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/ResourceProfileManager/#dynamicEnabled","title":"Dynamic Allocation","text":"

                                                                                                                                                                                                                                                                                                                                    ResourceProfileManager initializes dynamicEnabled flag to be isDynamicAllocationEnabled when created.

                                                                                                                                                                                                                                                                                                                                    dynamicEnabled flag is used when:

                                                                                                                                                                                                                                                                                                                                    • isSupported
                                                                                                                                                                                                                                                                                                                                    • canBeScheduled
                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/ResourceProfileManager/#isSupported","title":"isSupported","text":"
                                                                                                                                                                                                                                                                                                                                    isSupported(\n  rp: ResourceProfile): Boolean\n

                                                                                                                                                                                                                                                                                                                                    isSupported...FIXME

                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/ResourceProfileManager/#canBeScheduled","title":"canBeScheduled","text":"
                                                                                                                                                                                                                                                                                                                                    canBeScheduled(\n  taskRpId: Int,\n  executorRpId: Int): Boolean\n

                                                                                                                                                                                                                                                                                                                                    canBeScheduled asserts that the given taskRpId and executorRpId are valid ResourceProfile IDs or throws an AssertionError:

                                                                                                                                                                                                                                                                                                                                    Tasks and executors must have valid resource profile id\n

                                                                                                                                                                                                                                                                                                                                    canBeScheduled finds the ResourceProfile.

                                                                                                                                                                                                                                                                                                                                    canBeScheduled holds positive (true) when either holds:

                                                                                                                                                                                                                                                                                                                                    1. The given taskRpId and executorRpId are the same
                                                                                                                                                                                                                                                                                                                                    2. Dynamic Allocation is disabled and the ResourceProfile is a TaskResourceProfile

                                                                                                                                                                                                                                                                                                                                    canBeScheduled is used when:

                                                                                                                                                                                                                                                                                                                                    • TaskSchedulerImpl is requested to resourceOfferSingleTaskSet and calculateAvailableSlots
                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/ResourceProfileManager/#logging","title":"Logging","text":"

                                                                                                                                                                                                                                                                                                                                    Enable ALL logging level for org.apache.spark.resource.ResourceProfileManager logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                    Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                                                                                                    logger.ResourceProfileManager.name = org.apache.spark.resource.ResourceProfileManager\nlogger.ResourceProfileManager.level = all\n

                                                                                                                                                                                                                                                                                                                                    Refer to Logging.

                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/ResourceUtils/","title":"ResourceUtils","text":""},{"location":"stage-level-scheduling/ResourceUtils/#addTaskResourceRequests","title":"Registering Task Resource Requests (from SparkConf)","text":"
                                                                                                                                                                                                                                                                                                                                    addTaskResourceRequests(\n  sparkConf: SparkConf,\n  treqs: TaskResourceRequests): Unit\n

                                                                                                                                                                                                                                                                                                                                    addTaskResourceRequests registers all task resource requests in the given SparkConf with the given TaskResourceRequests.

                                                                                                                                                                                                                                                                                                                                    addTaskResourceRequests listResourceIds with spark.task component name in the given SparkConf.

                                                                                                                                                                                                                                                                                                                                    For every ResourceID discovered, addTaskResourceRequests does the following:

                                                                                                                                                                                                                                                                                                                                    1. Finds all the settings with the confPrefix
                                                                                                                                                                                                                                                                                                                                    2. Looks up amount setting (or throws a SparkException)
                                                                                                                                                                                                                                                                                                                                    3. Registers the resourceName with the amount in the given TaskResourceRequests

                                                                                                                                                                                                                                                                                                                                    addTaskResourceRequests is used when:

                                                                                                                                                                                                                                                                                                                                    • ResourceProfile is requested for the default task resource requests
                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/ResourceUtils/#listResourceIds","title":"Listing All Configured Resources","text":"
                                                                                                                                                                                                                                                                                                                                    listResourceIds(\n  sparkConf: SparkConf,\n  componentName: String): Seq[ResourceID]\n

                                                                                                                                                                                                                                                                                                                                    listResourceIds requests the given SparkConf to find all Spark settings with the keys with the prefix of the following pattern:

                                                                                                                                                                                                                                                                                                                                    [componentName].resource.\n
                                                                                                                                                                                                                                                                                                                                    Internals

                                                                                                                                                                                                                                                                                                                                    listResourceIds gets resource-related settings (from SparkConf) with the prefix removed (e.g., spark.my_component.resource.gpu.amount becomes just gpu.amount).

                                                                                                                                                                                                                                                                                                                                    Example
                                                                                                                                                                                                                                                                                                                                    // Use the following to start spark-shell\n// ./bin/spark-shell -c spark.my_component.resource.gpu.amount=5\n\nval sparkConf = sc.getConf\n\n// Component names must start with `spark.` prefix\n// Spark assumes valid Spark settings start with `spark.` prefix\nval componentName = \"spark.my_component\"\n\n// this is copied verbatim from ResourceUtils.listResourceIds\n// Note that `resource` is hardcoded\nsparkConf.getAllWithPrefix(s\"$componentName.resource.\").foreach(println)\n\n// (gpu.amount,5)\n

                                                                                                                                                                                                                                                                                                                                    listResourceIds asserts that resource settings include a . (dot) to separate their resource names from configs or throws the following SparkException:

                                                                                                                                                                                                                                                                                                                                    You must specify an amount config for resource: [key] config: [componentName].resource.[key]\n
                                                                                                                                                                                                                                                                                                                                    SPARK-43947

                                                                                                                                                                                                                                                                                                                                    Although the exception says You must specify an amount config for resource, only the dot is checked.

                                                                                                                                                                                                                                                                                                                                    // Use the following to start spark-shell\n// 1. No amount config specified\n// 2. spark.driver is a Spark built-in resource\n// ./bin/spark-shell -c spark.driver.resource.gpu=5\n

                                                                                                                                                                                                                                                                                                                                    Reported as SPARK-43947.

                                                                                                                                                                                                                                                                                                                                    In the end, listResourceIds creates a ResourceID for every resource (with the givencomponentName and resource names discovered).

                                                                                                                                                                                                                                                                                                                                    listResourceIds is used when:

                                                                                                                                                                                                                                                                                                                                    • ResourceUtils is requested to parseAllResourceRequests, addTaskResourceRequests, parseResourceRequirements, parseAllocatedOrDiscoverResources
                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/ResourceUtils/#parseAllResourceRequests","title":"parseAllResourceRequests","text":"
                                                                                                                                                                                                                                                                                                                                    parseAllResourceRequests(\n  sparkConf: SparkConf,\n  componentName: String): Seq[ResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                    parseAllResourceRequests...FIXME

                                                                                                                                                                                                                                                                                                                                    When componentName ResourceProfile spark.executor ResourceUtils KubernetesUtils (Spark on Kubernetes)

                                                                                                                                                                                                                                                                                                                                    parseAllResourceRequests is used when:

                                                                                                                                                                                                                                                                                                                                    • ResourceProfile is requested for the default executor resource requests
                                                                                                                                                                                                                                                                                                                                    • ResourceUtils is requested to getOrDiscoverAllResources
                                                                                                                                                                                                                                                                                                                                    • KubernetesUtils (Spark on Kubernetes) is requested to buildResourcesQuantities
                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/ResourceUtils/#getOrDiscoverAllResources","title":"getOrDiscoverAllResources","text":"
                                                                                                                                                                                                                                                                                                                                    getOrDiscoverAllResources(\n  sparkConf: SparkConf,\n  componentName: String,\n  resourcesFileOpt: Option[String]): Map[String, ResourceInformation]\n

                                                                                                                                                                                                                                                                                                                                    getOrDiscoverAllResources...FIXME

                                                                                                                                                                                                                                                                                                                                    When componentName resourcesFileOpt SparkContext spark.driver spark.driver.resourcesFile Worker (Spark Standalone) spark.worker spark.worker.resourcesFile

                                                                                                                                                                                                                                                                                                                                    getOrDiscoverAllResources is used when:

                                                                                                                                                                                                                                                                                                                                    • SparkContext is created (and initializes _resources)
                                                                                                                                                                                                                                                                                                                                    • Worker (Spark Standalone) is requested to setupWorkerResources
                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/ResourceUtils/#parseAllocatedOrDiscoverResources","title":"parseAllocatedOrDiscoverResources","text":"
                                                                                                                                                                                                                                                                                                                                    parseAllocatedOrDiscoverResources(\n  sparkConf: SparkConf,\n  componentName: String,\n  resourcesFileOpt: Option[String]): Seq[ResourceAllocation]\n

                                                                                                                                                                                                                                                                                                                                    parseAllocatedOrDiscoverResources...FIXME

                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/ResourceUtils/#parseResourceRequirements","title":"parseResourceRequirements (Spark Standalone)","text":"
                                                                                                                                                                                                                                                                                                                                    parseResourceRequirements(\n  sparkConf: SparkConf,\n  componentName: String): Seq[ResourceRequirement]\n

                                                                                                                                                                                                                                                                                                                                    parseResourceRequirements...FIXME

                                                                                                                                                                                                                                                                                                                                    componentName

                                                                                                                                                                                                                                                                                                                                    componentName seems to be always spark.driver for the use cases that seems to be Spark Standalone only.

                                                                                                                                                                                                                                                                                                                                    parseResourceRequirements is used when:

                                                                                                                                                                                                                                                                                                                                    • ClientEndpoint (Spark Standalone) is requested to onStart
                                                                                                                                                                                                                                                                                                                                    • StandaloneSubmitRequestServlet (Spark Standalone) is requested to buildDriverDescription
                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/SparkListenerResourceProfileAdded/","title":"SparkListenerResourceProfileAdded","text":"

                                                                                                                                                                                                                                                                                                                                    SparkListenerResourceProfileAdded is a SparkListenerEvent.

                                                                                                                                                                                                                                                                                                                                    SparkListenerResourceProfileAdded can be intercepted using the following Spark listeners:

                                                                                                                                                                                                                                                                                                                                    • SparkFirehoseListener
                                                                                                                                                                                                                                                                                                                                    • SparkListenerInterface
                                                                                                                                                                                                                                                                                                                                    • SparkListener

                                                                                                                                                                                                                                                                                                                                    SparkListenerResourceProfileAdded is recorded using AppStatusListener for status reporting and monitoring.

                                                                                                                                                                                                                                                                                                                                    ","tags":["DeveloperApi"]},{"location":"stage-level-scheduling/SparkListenerResourceProfileAdded/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                    SparkListenerResourceProfileAdded takes the following to be created:

                                                                                                                                                                                                                                                                                                                                    • ResourceProfile

                                                                                                                                                                                                                                                                                                                                      SparkListenerResourceProfileAdded is created when:

                                                                                                                                                                                                                                                                                                                                      • ResourceProfileManager is requested to register a new ResourceProfile
                                                                                                                                                                                                                                                                                                                                      • JsonProtocol (Spark History Server) is requested to resourceProfileAddedFromJson
                                                                                                                                                                                                                                                                                                                                      ","tags":["DeveloperApi"]},{"location":"stage-level-scheduling/SparkListenerResourceProfileAdded/#spark-history-server","title":"Spark History Server","text":"

                                                                                                                                                                                                                                                                                                                                      SparkListenerResourceProfileAdded is logged in Spark History Server using EventLoggingListener.

                                                                                                                                                                                                                                                                                                                                      SparkListenerResourceProfileAdded is converted from and to JSON format using JsonProtocol (resourceProfileAddedFromJson and resourceProfileAddedToJson, respectively).

                                                                                                                                                                                                                                                                                                                                      ","tags":["DeveloperApi"]},{"location":"stage-level-scheduling/TaskResourceProfile/","title":"TaskResourceProfile","text":"

                                                                                                                                                                                                                                                                                                                                      TaskResourceProfile is a ResourceProfile.

                                                                                                                                                                                                                                                                                                                                      "},{"location":"stage-level-scheduling/TaskResourceProfile/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                      TaskResourceProfile takes the following to be created:

                                                                                                                                                                                                                                                                                                                                      • Task Resources

                                                                                                                                                                                                                                                                                                                                        TaskResourceProfile is created when:

                                                                                                                                                                                                                                                                                                                                        • ResourceProfileBuilder is requested to build a ResourceProfile
                                                                                                                                                                                                                                                                                                                                        • DAGScheduler is requested to merge ResourceProfiles
                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/TaskResourceProfile/#getCustomExecutorResources","title":"getCustomExecutorResources","text":"ResourceProfile
                                                                                                                                                                                                                                                                                                                                        getCustomExecutorResources(): Map[String, ExecutorResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                        getCustomExecutorResources is part of the ResourceProfile abstraction.

                                                                                                                                                                                                                                                                                                                                        getCustomExecutorResources...FIXME

                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/TaskResourceRequest/","title":"TaskResourceRequest","text":"

                                                                                                                                                                                                                                                                                                                                        TaskResourceRequest is...FIXME

                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/TaskResourceRequests/","title":"TaskResourceRequests","text":"

                                                                                                                                                                                                                                                                                                                                        TaskResourceRequests is a convenience API to work with TaskResourceRequests (and hence the name \ud83d\ude09).

                                                                                                                                                                                                                                                                                                                                        TaskResourceRequests can be defined as required using ResourceProfileBuilder.

                                                                                                                                                                                                                                                                                                                                        TaskResourceRequests can be specified using configuration properties (using spark.task prefix).

                                                                                                                                                                                                                                                                                                                                        Resource Name Registerer cpus cpus user-defined name resource, addRequest"},{"location":"stage-level-scheduling/TaskResourceRequests/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                        TaskResourceRequests takes no arguments to be created.

                                                                                                                                                                                                                                                                                                                                        TaskResourceRequests is created when:

                                                                                                                                                                                                                                                                                                                                        • ResourceProfile is requested for the default task resource requests
                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/TaskResourceRequests/#serializable","title":"Serializable","text":"

                                                                                                                                                                                                                                                                                                                                        TaskResourceRequests is Serializable (Java).

                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/TaskResourceRequests/#cpus","title":"cpus","text":"
                                                                                                                                                                                                                                                                                                                                        cpus(\n  amount: Int): this.type\n

                                                                                                                                                                                                                                                                                                                                        cpus registers a TaskResourceRequest with cpus resource name and the given amount (in the _taskResources registry) under the name cpus.

                                                                                                                                                                                                                                                                                                                                        Fluent API

                                                                                                                                                                                                                                                                                                                                        cpus is part of the fluent API of (and hence this strange-looking this.type return type).

                                                                                                                                                                                                                                                                                                                                        cpus is used when:

                                                                                                                                                                                                                                                                                                                                        • ResourceProfile is requested for the default task resource requests
                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/TaskResourceRequests/#_taskResources","title":"_taskResources","text":"
                                                                                                                                                                                                                                                                                                                                        _taskResources: ConcurrentHashMap[String, TaskResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                        _taskResources is a collection of TaskResourceRequests by their resource name.

                                                                                                                                                                                                                                                                                                                                        _taskResources is available as requests.

                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/TaskResourceRequests/#requests","title":"requests","text":"
                                                                                                                                                                                                                                                                                                                                        requests: Map[String, TaskResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                        requests returns the _taskResources (converted to Scala).

                                                                                                                                                                                                                                                                                                                                        requests is used when:

                                                                                                                                                                                                                                                                                                                                        • ResourceProfile is requested for the default task resource requests
                                                                                                                                                                                                                                                                                                                                        • ResourceProfileBuilder is requested to require
                                                                                                                                                                                                                                                                                                                                        • TaskResourceRequests is requested for the string representation
                                                                                                                                                                                                                                                                                                                                        "},{"location":"status/","title":"Status","text":"

                                                                                                                                                                                                                                                                                                                                        Status system uses AppStatusListener to write the state of a Spark application to AppStatusStore for reporting and monitoring:

                                                                                                                                                                                                                                                                                                                                        • web UI
                                                                                                                                                                                                                                                                                                                                        • REST API
                                                                                                                                                                                                                                                                                                                                        • Spark History Server
                                                                                                                                                                                                                                                                                                                                        • Metrics
                                                                                                                                                                                                                                                                                                                                        "},{"location":"status/AppStatusListener/","title":"AppStatusListener","text":"

                                                                                                                                                                                                                                                                                                                                        AppStatusListener is a SparkListener that writes application state information to a data store.

                                                                                                                                                                                                                                                                                                                                        "},{"location":"status/AppStatusListener/#event-handlers","title":"Event Handlers","text":"Event Handler LiveEntities onJobStart
                                                                                                                                                                                                                                                                                                                                        • LiveJob
                                                                                                                                                                                                                                                                                                                                        • LiveStage
                                                                                                                                                                                                                                                                                                                                        • RDDOperationGraph onStageSubmitted"},{"location":"status/AppStatusListener/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                          AppStatusListener takes the following to be created:

                                                                                                                                                                                                                                                                                                                                          • ElementTrackingStore
                                                                                                                                                                                                                                                                                                                                          • SparkConf
                                                                                                                                                                                                                                                                                                                                          • live flag
                                                                                                                                                                                                                                                                                                                                          • AppStatusSource (default: None)
                                                                                                                                                                                                                                                                                                                                          • Last Update Time (default: None)

                                                                                                                                                                                                                                                                                                                                            AppStatusListener is created when:

                                                                                                                                                                                                                                                                                                                                            • AppStatusStore is requested for a in-memory store for a running Spark application (with the live flag enabled)
                                                                                                                                                                                                                                                                                                                                            • FsHistoryProvider is requested to rebuildAppStore (with the live flag disabled)
                                                                                                                                                                                                                                                                                                                                            "},{"location":"status/AppStatusListener/#elementtrackingstore","title":"ElementTrackingStore

                                                                                                                                                                                                                                                                                                                                            AppStatusListener is given an ElementTrackingStore when created.

                                                                                                                                                                                                                                                                                                                                            AppStatusListener registers triggers to clean up state in the store:

                                                                                                                                                                                                                                                                                                                                            • cleanupExecutors
                                                                                                                                                                                                                                                                                                                                            • cleanupJobs
                                                                                                                                                                                                                                                                                                                                            • cleanupStages

                                                                                                                                                                                                                                                                                                                                            ElementTrackingStore is used to write and...FIXME

                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"status/AppStatusListener/#live-flag","title":"live Flag

                                                                                                                                                                                                                                                                                                                                            AppStatusListener is given a live flag when created.

                                                                                                                                                                                                                                                                                                                                            live flag indicates whether AppStatusListener is created for the following:

                                                                                                                                                                                                                                                                                                                                            • true when created for a active (live) Spark application (for AppStatusStore)
                                                                                                                                                                                                                                                                                                                                            • false when created for Spark History Server (for FsHistoryProvider)
                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"status/AppStatusListener/#updating-elementtrackingstore-for-active-spark-application","title":"Updating ElementTrackingStore for Active Spark Application
                                                                                                                                                                                                                                                                                                                                            liveUpdate(\n  entity: LiveEntity,\n  now: Long): Unit\n

                                                                                                                                                                                                                                                                                                                                            liveUpdate update the ElementTrackingStore when the live flag is enabled.

                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"status/AppStatusListener/#updating-elementtrackingstore","title":"Updating ElementTrackingStore
                                                                                                                                                                                                                                                                                                                                            update(\n  entity: LiveEntity,\n  now: Long,\n  last: Boolean = false): Unit\n

                                                                                                                                                                                                                                                                                                                                            update requests the given LiveEntity to write (with the ElementTrackingStore and checkTriggers flag being the given last flag).

                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"status/AppStatusListener/#getorcreateexecutor","title":"getOrCreateExecutor
                                                                                                                                                                                                                                                                                                                                            getOrCreateExecutor(\n  executorId: String,\n  addTime: Long): LiveExecutor\n

                                                                                                                                                                                                                                                                                                                                            getOrCreateExecutor...FIXME

                                                                                                                                                                                                                                                                                                                                            getOrCreateExecutor\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                            • AppStatusListener is requested to onExecutorAdded and onBlockManagerAdded
                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"status/AppStatusListener/#getorcreatestage","title":"getOrCreateStage
                                                                                                                                                                                                                                                                                                                                            getOrCreateStage(\n  info: StageInfo): LiveStage\n

                                                                                                                                                                                                                                                                                                                                            getOrCreateStage...FIXME

                                                                                                                                                                                                                                                                                                                                            getOrCreateStage\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                            • AppStatusListener is requested to onJobStart and onStageSubmitted
                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"status/AppStatusSource/","title":"AppStatusSource","text":"

                                                                                                                                                                                                                                                                                                                                            AppStatusSource is...FIXME

                                                                                                                                                                                                                                                                                                                                            "},{"location":"status/AppStatusStore/","title":"AppStatusStore","text":"

                                                                                                                                                                                                                                                                                                                                            AppStatusStore stores the state of a Spark application in a data store (listening to state changes using AppStatusListener).

                                                                                                                                                                                                                                                                                                                                            "},{"location":"status/AppStatusStore/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                            AppStatusStore takes the following to be created:

                                                                                                                                                                                                                                                                                                                                            • KVStore
                                                                                                                                                                                                                                                                                                                                            • AppStatusListener

                                                                                                                                                                                                                                                                                                                                              AppStatusStore is created\u00a0using createLiveStore utility.

                                                                                                                                                                                                                                                                                                                                              "},{"location":"status/AppStatusStore/#creating-in-memory-store-for-live-spark-application","title":"Creating In-Memory Store for Live Spark Application
                                                                                                                                                                                                                                                                                                                                              createLiveStore(\n  conf: SparkConf,\n  appStatusSource: Option[AppStatusSource] = None): AppStatusStore\n

                                                                                                                                                                                                                                                                                                                                              createLiveStore creates an ElementTrackingStore (with InMemoryStore and the SparkConf).

                                                                                                                                                                                                                                                                                                                                              createLiveStore creates an AppStatusListener (with the ElementTrackingStore, live flag on and the AppStatusSource).

                                                                                                                                                                                                                                                                                                                                              In the end, creates an AppStatusStore (with the ElementTrackingStore and AppStatusListener).

                                                                                                                                                                                                                                                                                                                                              createLiveStore\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                              • SparkContext is created
                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"status/AppStatusStore/#accessing-appstatusstore","title":"Accessing AppStatusStore

                                                                                                                                                                                                                                                                                                                                              AppStatusStore is available using SparkContext.

                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"status/AppStatusStore/#sparkstatustracker","title":"SparkStatusTracker

                                                                                                                                                                                                                                                                                                                                              AppStatusStore is used to create SparkStatusTracker.

                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"status/AppStatusStore/#sparkui","title":"SparkUI

                                                                                                                                                                                                                                                                                                                                              AppStatusStore is used to create SparkUI.

                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"status/AppStatusStore/#rdds","title":"RDDs
                                                                                                                                                                                                                                                                                                                                              rddList(\n  cachedOnly: Boolean = true): Seq[v1.RDDStorageInfo]\n

                                                                                                                                                                                                                                                                                                                                              rddList requests the KVStore for (a view over) RDDStorageInfos (cached or not based on the given cachedOnly flag).

                                                                                                                                                                                                                                                                                                                                              rddList\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                              • AbstractApplicationResource is requested for the RDDs
                                                                                                                                                                                                                                                                                                                                              • StageTableBase is created (and renders a stage table for AllStagesPage, JobPage and PoolPage)
                                                                                                                                                                                                                                                                                                                                              • StoragePage is requested to render
                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"status/AppStatusStore/#streaming-blocks","title":"Streaming Blocks
                                                                                                                                                                                                                                                                                                                                              streamBlocksList(): Seq[StreamBlockData]\n

                                                                                                                                                                                                                                                                                                                                              streamBlocksList requests the KVStore for (a view over) StreamBlockDatas.

                                                                                                                                                                                                                                                                                                                                              streamBlocksList\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                              • StoragePage is requested to render
                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"status/AppStatusStore/#stages","title":"Stages
                                                                                                                                                                                                                                                                                                                                              stageList(\n  statuses: JList[v1.StageStatus]): Seq[v1.StageData]\n

                                                                                                                                                                                                                                                                                                                                              stageList requests the KVStore for (a view over) StageDatas.

                                                                                                                                                                                                                                                                                                                                              stageList\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                              • SparkStatusTracker is requested for active stage IDs
                                                                                                                                                                                                                                                                                                                                              • StagesResource is requested for stages
                                                                                                                                                                                                                                                                                                                                              • AllStagesPage is requested to render
                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"status/AppStatusStore/#jobs","title":"Jobs
                                                                                                                                                                                                                                                                                                                                              jobsList(\n  statuses: JList[JobExecutionStatus]): Seq[v1.JobData]\n

                                                                                                                                                                                                                                                                                                                                              jobsList requests the KVStore for (a view over) JobDatas.

                                                                                                                                                                                                                                                                                                                                              jobsList\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                              • SparkStatusTracker is requested for getJobIdsForGroup and getActiveJobIds
                                                                                                                                                                                                                                                                                                                                              • AbstractApplicationResource is requested for jobs
                                                                                                                                                                                                                                                                                                                                              • AllJobsPage is requested to render
                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"status/AppStatusStore/#executors","title":"Executors
                                                                                                                                                                                                                                                                                                                                              executorList(\n  activeOnly: Boolean): Seq[v1.ExecutorSummary]\n

                                                                                                                                                                                                                                                                                                                                              executorList requests the KVStore for (a view over) ExecutorSummarys.

                                                                                                                                                                                                                                                                                                                                              executorList\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                              • FIXME
                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"status/AppStatusStore/#application-summary","title":"Application Summary
                                                                                                                                                                                                                                                                                                                                              appSummary(): AppSummary\n

                                                                                                                                                                                                                                                                                                                                              appSummary requests the KVStore to read the AppSummary.

                                                                                                                                                                                                                                                                                                                                              appSummary\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                              • AllJobsPage is requested to render
                                                                                                                                                                                                                                                                                                                                              • AllStagesPage is requested to render
                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"status/ElementTrackingStore/","title":"ElementTrackingStore","text":"

                                                                                                                                                                                                                                                                                                                                              ElementTrackingStore is a KVStore that tracks the number of entities (elements) of specific types in a store and triggers actions once they reach a threshold.

                                                                                                                                                                                                                                                                                                                                              "},{"location":"status/ElementTrackingStore/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                              ElementTrackingStore takes the following to be created:

                                                                                                                                                                                                                                                                                                                                              • KVStore
                                                                                                                                                                                                                                                                                                                                              • SparkConf

                                                                                                                                                                                                                                                                                                                                                ElementTrackingStore is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                • AppStatusStore is requested to createLiveStore
                                                                                                                                                                                                                                                                                                                                                • FsHistoryProvider is requested to rebuildAppStore
                                                                                                                                                                                                                                                                                                                                                "},{"location":"status/ElementTrackingStore/#writing-value-to-store","title":"Writing Value to Store
                                                                                                                                                                                                                                                                                                                                                write(\n  value: Any): Unit\n

                                                                                                                                                                                                                                                                                                                                                write\u00a0is part of the KVStore abstraction.

                                                                                                                                                                                                                                                                                                                                                write requests the KVStore to write the value

                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"status/ElementTrackingStore/#writing-value-to-store-and-checking-triggers","title":"Writing Value to Store and Checking Triggers
                                                                                                                                                                                                                                                                                                                                                write(\n  value: Any,\n  checkTriggers: Boolean): WriteQueueResult\n

                                                                                                                                                                                                                                                                                                                                                write writes the value.

                                                                                                                                                                                                                                                                                                                                                write...FIXME

                                                                                                                                                                                                                                                                                                                                                write is used when:

                                                                                                                                                                                                                                                                                                                                                • LiveEntity is requested to write
                                                                                                                                                                                                                                                                                                                                                • StreamingQueryStatusListener (Spark Structured Streaming) is requested to onQueryStarted and onQueryTerminated
                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"status/ElementTrackingStore/#creating-view-of-specific-entities","title":"Creating View of Specific Entities
                                                                                                                                                                                                                                                                                                                                                view[T](\n  klass: Class[T]): KVStoreView[T]\n

                                                                                                                                                                                                                                                                                                                                                view\u00a0is part of the KVStore abstraction.

                                                                                                                                                                                                                                                                                                                                                view requests the KVStore for a view of klass entities.

                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"status/ElementTrackingStore/#registering-trigger","title":"Registering Trigger
                                                                                                                                                                                                                                                                                                                                                addTrigger(\n  klass: Class[_],\n  threshold: Long)(\n  action: Long => Unit): Unit\n

                                                                                                                                                                                                                                                                                                                                                addTrigger...FIXME

                                                                                                                                                                                                                                                                                                                                                addTrigger is used when:

                                                                                                                                                                                                                                                                                                                                                • AppStatusListener is created
                                                                                                                                                                                                                                                                                                                                                • HiveThriftServer2Listener (Spark Thrift Server) is created
                                                                                                                                                                                                                                                                                                                                                • SQLAppStatusListener (Spark SQL) is created
                                                                                                                                                                                                                                                                                                                                                • StreamingQueryStatusListener (Spark Structured Streaming) is created
                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"status/LiveEntity/","title":"LiveEntity","text":"

                                                                                                                                                                                                                                                                                                                                                LiveEntity is an abstraction of entities of a running (live) Spark application.

                                                                                                                                                                                                                                                                                                                                                "},{"location":"status/LiveEntity/#contract","title":"Contract","text":""},{"location":"status/LiveEntity/#doupdate","title":"doUpdate
                                                                                                                                                                                                                                                                                                                                                doUpdate(): Any\n

                                                                                                                                                                                                                                                                                                                                                Updated view of this entity's data

                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                • LiveEntity is requested to write out to the store
                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"status/LiveEntity/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                • LiveExecutionData (Spark SQL)
                                                                                                                                                                                                                                                                                                                                                • LiveExecutionData (Spark Thrift Server)
                                                                                                                                                                                                                                                                                                                                                • LiveExecutor
                                                                                                                                                                                                                                                                                                                                                • LiveExecutorStageSummary
                                                                                                                                                                                                                                                                                                                                                • LiveJob
                                                                                                                                                                                                                                                                                                                                                • LiveRDD
                                                                                                                                                                                                                                                                                                                                                • LiveResourceProfile
                                                                                                                                                                                                                                                                                                                                                • LiveSessionData
                                                                                                                                                                                                                                                                                                                                                • LiveStage
                                                                                                                                                                                                                                                                                                                                                • LiveTask
                                                                                                                                                                                                                                                                                                                                                • SchedulerPool
                                                                                                                                                                                                                                                                                                                                                "},{"location":"status/LiveEntity/#writing-out-to-store","title":"Writing Out to Store
                                                                                                                                                                                                                                                                                                                                                write(\n  store: ElementTrackingStore,\n  now: Long,\n  checkTriggers: Boolean = false): Unit\n

                                                                                                                                                                                                                                                                                                                                                write...FIXME

                                                                                                                                                                                                                                                                                                                                                write\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                • AppStatusListener is requested to update
                                                                                                                                                                                                                                                                                                                                                • HiveThriftServer2Listener (Spark Thrift Server) is requested to updateStoreWithTriggerEnabled and updateLiveStore
                                                                                                                                                                                                                                                                                                                                                • SQLAppStatusListener (Spark SQL) is requested to update
                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/","title":"Storage System","text":"

                                                                                                                                                                                                                                                                                                                                                Storage System is a core component of Apache Spark that uses BlockManager to manage blocks in memory and on disk (based on StorageLevel).

                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockData/","title":"BlockData","text":"

                                                                                                                                                                                                                                                                                                                                                = BlockData

                                                                                                                                                                                                                                                                                                                                                BlockData is...FIXME

                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockDataManager/","title":"BlockDataManager","text":"

                                                                                                                                                                                                                                                                                                                                                BlockDataManager is an abstraction of block data managers that manage storage for blocks of data (aka block storage management API).

                                                                                                                                                                                                                                                                                                                                                BlockDataManager uses BlockId to uniquely identify blocks of data and ManagedBuffer to represent them.

                                                                                                                                                                                                                                                                                                                                                BlockDataManager is used to initialize a BlockTransferService.

                                                                                                                                                                                                                                                                                                                                                BlockDataManager is used to create a NettyBlockRpcServer.

                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockDataManager/#contract","title":"Contract","text":""},{"location":"storage/BlockDataManager/#diagnoseshuffleblockcorruption","title":"diagnoseShuffleBlockCorruption
                                                                                                                                                                                                                                                                                                                                                diagnoseShuffleBlockCorruption(\n  blockId: BlockId,\n  checksumByReader: Long,\n  algorithm: String): Cause\n
                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockDataManager/#gethostlocalshuffledata","title":"getHostLocalShuffleData
                                                                                                                                                                                                                                                                                                                                                getHostLocalShuffleData(\n  blockId: BlockId,\n  dirs: Array[String]): ManagedBuffer\n

                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                • ShuffleBlockFetcherIterator is requested to fetchHostLocalBlock
                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockDataManager/#getlocalblockdata","title":"getLocalBlockData
                                                                                                                                                                                                                                                                                                                                                getLocalBlockData(\n  blockId: BlockId): ManagedBuffer\n

                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                • NettyBlockRpcServer is requested to receive a request (OpenBlocks and FetchShuffleBlocks)
                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockDataManager/#getlocaldiskdirs","title":"getLocalDiskDirs
                                                                                                                                                                                                                                                                                                                                                getLocalDiskDirs: Array[String]\n

                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                • NettyBlockRpcServer is requested to handle a GetLocalDirsForExecutors request
                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockDataManager/#putblockdata","title":"putBlockData
                                                                                                                                                                                                                                                                                                                                                putBlockData(\n  blockId: BlockId,\n  data: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Boolean\n

                                                                                                                                                                                                                                                                                                                                                Stores (puts) a block data (as a ManagedBuffer) for the given BlockId. Returns true when completed successfully or false when failed.

                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                • NettyBlockRpcServer is requested to receive a UploadBlock request
                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockDataManager/#putblockdataasstream","title":"putBlockDataAsStream
                                                                                                                                                                                                                                                                                                                                                putBlockDataAsStream(\n  blockId: BlockId,\n  level: StorageLevel,\n  classTag: ClassTag[_]): StreamCallbackWithID\n

                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                • NettyBlockRpcServer is requested to receiveStream
                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockDataManager/#releaselock","title":"releaseLock
                                                                                                                                                                                                                                                                                                                                                releaseLock(\n  blockId: BlockId,\n  taskContext: Option[TaskContext]): Unit\n

                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                • TorrentBroadcast is requested to releaseBlockManagerLock
                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to handleLocalReadFailure, getLocalValues, getOrElseUpdate, doPut, releaseLockAndDispose
                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockDataManager/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                • BlockManager
                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockEvictionHandler/","title":"BlockEvictionHandler","text":"

                                                                                                                                                                                                                                                                                                                                                BlockEvictionHandler is an abstraction of block eviction handlers that can drop blocks from memory.

                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockEvictionHandler/#contract","title":"Contract","text":""},{"location":"storage/BlockEvictionHandler/#dropping-block-from-memory","title":"Dropping Block from Memory
                                                                                                                                                                                                                                                                                                                                                dropFromMemory[T: ClassTag](\n  blockId: BlockId,\n  data: () => Either[Array[T], ChunkedByteBuffer]): StorageLevel\n

                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                • MemoryStore is requested to evict blocks
                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockEvictionHandler/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                • BlockManager
                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockId/","title":"BlockId","text":"

                                                                                                                                                                                                                                                                                                                                                BlockId is an abstraction of data block identifiers based on an unique name.

                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockId/#contract","title":"Contract","text":""},{"location":"storage/BlockId/#name","title":"Name
                                                                                                                                                                                                                                                                                                                                                name: String\n

                                                                                                                                                                                                                                                                                                                                                A globally unique identifier of this Block

                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to putBlockDataAsStream and readDiskBlockFromSameHostExecutor
                                                                                                                                                                                                                                                                                                                                                • UpdateBlockInfo is requested to writeExternal
                                                                                                                                                                                                                                                                                                                                                • DiskBlockManager is requested to getFile and containsBlock
                                                                                                                                                                                                                                                                                                                                                • DiskStore is requested to getBytes, remove, moveFileToBlock, contains
                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockId/#implementations","title":"Implementations","text":"Sealed Abstract Class

                                                                                                                                                                                                                                                                                                                                                BlockId is a Scala sealed abstract class which means that all of the implementations are in the same compilation unit (a single file).

                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockId/#broadcastblockid","title":"BroadcastBlockId

                                                                                                                                                                                                                                                                                                                                                BlockId for broadcast variable blocks:

                                                                                                                                                                                                                                                                                                                                                • broadcastId identifier
                                                                                                                                                                                                                                                                                                                                                • Optional field name (default: empty)

                                                                                                                                                                                                                                                                                                                                                Uses broadcast_ prefix for the name

                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                • TorrentBroadcast is created, requested to store a broadcast and the blocks in a local BlockManager, and read blocks
                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to remove all the blocks of a broadcast variable
                                                                                                                                                                                                                                                                                                                                                • SerializerManager is requested to shouldCompress
                                                                                                                                                                                                                                                                                                                                                • AppStatusListener is requested to onBlockUpdated
                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockId/#rddblockid","title":"RDDBlockId

                                                                                                                                                                                                                                                                                                                                                BlockId for RDD partitions:

                                                                                                                                                                                                                                                                                                                                                • rddId identifier
                                                                                                                                                                                                                                                                                                                                                • splitIndex identifier

                                                                                                                                                                                                                                                                                                                                                Uses rdd_ prefix for the name

                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                • StorageStatus is requested to register the status of a data block, get the status of a data block, updateStorageInfo
                                                                                                                                                                                                                                                                                                                                                • LocalRDDCheckpointData is requested to doCheckpoint
                                                                                                                                                                                                                                                                                                                                                • RDD is requested to getOrCompute
                                                                                                                                                                                                                                                                                                                                                • DAGScheduler is requested for the BlockManagers (executors) for cached RDD partitions
                                                                                                                                                                                                                                                                                                                                                • BlockManagerMasterEndpoint is requested to removeRdd
                                                                                                                                                                                                                                                                                                                                                • AppStatusListener is requested to updateRDDBlock (when onBlockUpdated for an RDDBlockId)

                                                                                                                                                                                                                                                                                                                                                Compressed when spark.rdd.compress configuration property is enabled

                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockId/#shuffleblockbatchid","title":"ShuffleBlockBatchId","text":""},{"location":"storage/BlockId/#shuffleblockid","title":"ShuffleBlockId

                                                                                                                                                                                                                                                                                                                                                BlockId for shuffle blocks:

                                                                                                                                                                                                                                                                                                                                                • shuffleId identifier
                                                                                                                                                                                                                                                                                                                                                • mapId identifier
                                                                                                                                                                                                                                                                                                                                                • reduceId identifier

                                                                                                                                                                                                                                                                                                                                                Uses shuffle_ prefix for the name

                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                • ShuffleBlockFetcherIterator is requested to throwFetchFailedException
                                                                                                                                                                                                                                                                                                                                                • MapOutputTracker utility is requested to convertMapStatuses
                                                                                                                                                                                                                                                                                                                                                • NettyBlockRpcServer is requested to handle a FetchShuffleBlocks message
                                                                                                                                                                                                                                                                                                                                                • ExternalSorter is requested to writePartitionedMapOutput
                                                                                                                                                                                                                                                                                                                                                • ShuffleBlockFetcherIterator is requested to mergeContinuousShuffleBlockIdsIfNeeded
                                                                                                                                                                                                                                                                                                                                                • IndexShuffleBlockResolver is requested to getBlockData

                                                                                                                                                                                                                                                                                                                                                Compressed when spark.shuffle.compress configuration property is enabled

                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockId/#shuffledatablockid","title":"ShuffleDataBlockId","text":""},{"location":"storage/BlockId/#shuffleindexblockid","title":"ShuffleIndexBlockId","text":""},{"location":"storage/BlockId/#streamblockid","title":"StreamBlockId

                                                                                                                                                                                                                                                                                                                                                BlockId for ...FIXME:

                                                                                                                                                                                                                                                                                                                                                • streamId
                                                                                                                                                                                                                                                                                                                                                • uniqueId

                                                                                                                                                                                                                                                                                                                                                Uses the following name:

                                                                                                                                                                                                                                                                                                                                                input-[streamId]-[uniqueId]\n

                                                                                                                                                                                                                                                                                                                                                Used in Spark Streaming

                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockId/#taskresultblockid","title":"TaskResultBlockId","text":""},{"location":"storage/BlockId/#templocalblockid","title":"TempLocalBlockId","text":""},{"location":"storage/BlockId/#tempshuffleblockid","title":"TempShuffleBlockId","text":""},{"location":"storage/BlockId/#testblockid","title":"TestBlockId","text":""},{"location":"storage/BlockId/#creating-blockid-by-name","title":"Creating BlockId by Name
                                                                                                                                                                                                                                                                                                                                                apply(\n  name: String): BlockId\n

                                                                                                                                                                                                                                                                                                                                                apply creates one of the available BlockIds by the given name (that uses a prefix to differentiate between different BlockIds).

                                                                                                                                                                                                                                                                                                                                                apply is used when:

                                                                                                                                                                                                                                                                                                                                                • NettyBlockRpcServer is requested to handle OpenBlocks, UploadBlock messages and receiveStream
                                                                                                                                                                                                                                                                                                                                                • UpdateBlockInfo is requested to deserialize (readExternal)
                                                                                                                                                                                                                                                                                                                                                • DiskBlockManager is requested for all the blocks (from files stored on disk)
                                                                                                                                                                                                                                                                                                                                                • ShuffleBlockFetcherIterator is requested to sendRequest
                                                                                                                                                                                                                                                                                                                                                • JsonProtocol utility is used to accumValueFromJson, taskMetricsFromJson and blockUpdatedInfoFromJson
                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockInfo/","title":"BlockInfo","text":"

                                                                                                                                                                                                                                                                                                                                                BlockInfo is a metadata of data blocks (stored in MemoryStore or DiskStore).

                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockInfo/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                BlockInfo takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                • StorageLevel
                                                                                                                                                                                                                                                                                                                                                • ClassTag (Scala)
                                                                                                                                                                                                                                                                                                                                                • tellMaster flag

                                                                                                                                                                                                                                                                                                                                                  BlockInfo is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                  • BlockManager is requested to doPut
                                                                                                                                                                                                                                                                                                                                                  "},{"location":"storage/BlockInfo/#block-size","title":"Block Size

                                                                                                                                                                                                                                                                                                                                                  BlockInfo knows the size of the block (in bytes).

                                                                                                                                                                                                                                                                                                                                                  The size is 0 by default and changes when:

                                                                                                                                                                                                                                                                                                                                                  • BlockStoreUpdater is requested to save
                                                                                                                                                                                                                                                                                                                                                  • BlockManager is requested to doPutIterator
                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockInfo/#reader-count","title":"Reader Count

                                                                                                                                                                                                                                                                                                                                                  readerCount is the number of times that this block has been locked for reading

                                                                                                                                                                                                                                                                                                                                                  readerCount is 0 by default.

                                                                                                                                                                                                                                                                                                                                                  readerCount changes back to 0 when:

                                                                                                                                                                                                                                                                                                                                                  • BlockInfoManager is requested to remove a block and clear

                                                                                                                                                                                                                                                                                                                                                  readerCount is incremented when a read lock is acquired and decreases when the following happens:

                                                                                                                                                                                                                                                                                                                                                  • BlockInfoManager is requested to release a lock and releaseAllLocksForTask
                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockInfo/#writer-task","title":"Writer Task

                                                                                                                                                                                                                                                                                                                                                  writerTask attribute is the task ID that owns the write lock for the block or the following:

                                                                                                                                                                                                                                                                                                                                                  • -1 for no writers and hence no write lock in use
                                                                                                                                                                                                                                                                                                                                                  • -1024 for non-task threads (by a driver thread or by unit test code)

                                                                                                                                                                                                                                                                                                                                                    writerTask is assigned a task ID when:

                                                                                                                                                                                                                                                                                                                                                    • BlockInfoManager is requested to lockForWriting, unlock, releaseAllLocksForTask, removeBlock, clear
                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockInfoManager/","title":"BlockInfoManager","text":"

                                                                                                                                                                                                                                                                                                                                                    BlockInfoManager is used by BlockManager (and MemoryStore) to manage metadata of memory blocks and control concurrent access by locks for reading and writing.

                                                                                                                                                                                                                                                                                                                                                    BlockInfoManager is used to create a MemoryStore and a BlockManagerManagedBuffer.

                                                                                                                                                                                                                                                                                                                                                    "},{"location":"storage/BlockInfoManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                    BlockInfoManager takes no arguments to be created.

                                                                                                                                                                                                                                                                                                                                                    BlockInfoManager is created\u00a0for BlockManager

                                                                                                                                                                                                                                                                                                                                                    "},{"location":"storage/BlockInfoManager/#block-metadata","title":"Block Metadata
                                                                                                                                                                                                                                                                                                                                                    infos: HashMap[BlockId, BlockInfo]\n

                                                                                                                                                                                                                                                                                                                                                    BlockInfoManager uses a registry of block metadatas per block.

                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockInfoManager/#locks","title":"Locks

                                                                                                                                                                                                                                                                                                                                                    Locks are the mechanism to control concurrent access to data and prevent destructive interaction between operations that use the same resource.

                                                                                                                                                                                                                                                                                                                                                    BlockInfoManager uses read and write locks by task attempts.

                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockInfoManager/#read-locks","title":"Read Locks
                                                                                                                                                                                                                                                                                                                                                    readLocksByTask: HashMap[TaskAttemptId, ConcurrentHashMultiset[BlockId]]\n

                                                                                                                                                                                                                                                                                                                                                    BlockInfoManager uses readLocksByTask registry to track tasks (by TaskAttemptId) and the blocks they locked for reading (as BlockIds).

                                                                                                                                                                                                                                                                                                                                                    A new entry is added when BlockInfoManager is requested to register a task (attempt).

                                                                                                                                                                                                                                                                                                                                                    A new BlockId is added to an existing task attempt in lockForReading.

                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockInfoManager/#write-locks","title":"Write Locks

                                                                                                                                                                                                                                                                                                                                                    Tracks tasks (by TaskAttemptId) and the blocks they locked for writing (as BlockId.md[]).

                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockInfoManager/#registering-task-execution-attempt","title":"Registering Task (Execution Attempt)
                                                                                                                                                                                                                                                                                                                                                    registerTask(\n  taskAttemptId: Long): Unit\n

                                                                                                                                                                                                                                                                                                                                                    registerTask registers a new \"empty\" entry for the given task (by the task attempt ID) to the readLocksByTask internal registry.

                                                                                                                                                                                                                                                                                                                                                    registerTask is used when:

                                                                                                                                                                                                                                                                                                                                                    • BlockInfoManager is created
                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to registerTask
                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockInfoManager/#downgrading-exclusive-write-lock-to-shared-read-lock","title":"Downgrading Exclusive Write Lock to Shared Read Lock
                                                                                                                                                                                                                                                                                                                                                    downgradeLock(\n  blockId: BlockId): Unit\n

                                                                                                                                                                                                                                                                                                                                                    downgradeLock prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                    Task [currentTaskAttemptId] downgrading write lock for [blockId]\n

                                                                                                                                                                                                                                                                                                                                                    downgradeLock...FIXME

                                                                                                                                                                                                                                                                                                                                                    downgradeLock is used when:

                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to doPut and downgradeLock
                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockInfoManager/#obtaining-read-lock-for-block","title":"Obtaining Read Lock for Block
                                                                                                                                                                                                                                                                                                                                                    lockForReading(\n  blockId: BlockId,\n  blocking: Boolean = true): Option[BlockInfo]\n

                                                                                                                                                                                                                                                                                                                                                    lockForReading locks a given memory block for reading when the block was registered earlier and no writer tasks use it.

                                                                                                                                                                                                                                                                                                                                                    When executed, lockForReading prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                    Task [currentTaskAttemptId] trying to acquire read lock for [blockId]\n

                                                                                                                                                                                                                                                                                                                                                    lockForReading looks up the metadata of the blockId block (in the infos registry).

                                                                                                                                                                                                                                                                                                                                                    If no metadata could be found, lockForReading returns None which means that the block does not exist or was removed (and anybody could acquire a write lock).

                                                                                                                                                                                                                                                                                                                                                    Otherwise, when the metadata was found (i.e. registered) lockForReading checks so-called writerTask. Only when the block has no writer tasks, a read lock can be acquired. If so, the readerCount of the block metadata is incremented and the block is recorded (in the internal readLocksByTask registry). lockForReading prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                    Task [currentTaskAttemptId] acquired read lock for [blockId]\n

                                                                                                                                                                                                                                                                                                                                                    The BlockInfo for the blockId block is returned.

                                                                                                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                                                                                                    -1024 is a special taskAttemptId (NON_TASK_WRITER) used to mark a non-task thread, e.g. by a driver thread or by unit test code.

                                                                                                                                                                                                                                                                                                                                                    For blocks with writerTask other than NO_WRITER, when blocking is enabled, lockForReading waits (until another thread invokes the Object.notify method or the Object.notifyAll methods for this object).

                                                                                                                                                                                                                                                                                                                                                    With blocking enabled, it will repeat the waiting-for-read-lock sequence until either None or the lock is obtained.

                                                                                                                                                                                                                                                                                                                                                    When blocking is disabled and the lock could not be obtained, None is returned immediately.

                                                                                                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                                                                                                    lockForReading is a synchronized method, i.e. no two objects can use this and other instance methods.

                                                                                                                                                                                                                                                                                                                                                    lockForReading is used when:

                                                                                                                                                                                                                                                                                                                                                    • BlockInfoManager is requested to downgradeLock and lockNewBlockForWriting
                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to getLocalValues, getLocalBytes and replicateBlock
                                                                                                                                                                                                                                                                                                                                                    • BlockManagerManagedBuffer is requested to retain
                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockInfoManager/#obtaining-write-lock-for-block","title":"Obtaining Write Lock for Block
                                                                                                                                                                                                                                                                                                                                                    lockForWriting(\n  blockId: BlockId,\n  blocking: Boolean = true): Option[BlockInfo]\n

                                                                                                                                                                                                                                                                                                                                                    lockForWriting prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                    Task [currentTaskAttemptId] trying to acquire write lock for [blockId]\n

                                                                                                                                                                                                                                                                                                                                                    lockForWriting finds the blockId (in the infos registry). When no BlockInfo could be found, None is returned. Otherwise, blockId block is checked for writerTask to be BlockInfo.NO_WRITER with no readers (i.e. readerCount is 0) and only then the lock is returned.

                                                                                                                                                                                                                                                                                                                                                    When the write lock can be returned, BlockInfo.writerTask is set to currentTaskAttemptId and a new binding is added to the internal writeLocksByTask registry. lockForWriting prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                    Task [currentTaskAttemptId] acquired write lock for [blockId]\n

                                                                                                                                                                                                                                                                                                                                                    If, for some reason, BlockInfo.md#writerTask[blockId has a writer] or the number of readers is positive (i.e. BlockInfo.readerCount is greater than 0), the method will wait (based on the input blocking flag) and attempt the write lock acquisition process until it finishes with a write lock.

                                                                                                                                                                                                                                                                                                                                                    NOTE: (deadlock possible) The method is synchronized and can block, i.e. wait that causes the current thread to wait until another thread invokes Object.notify or Object.notifyAll methods for this object.

                                                                                                                                                                                                                                                                                                                                                    lockForWriting returns None for no blockId in the internal infos registry or when blocking flag is disabled and the write lock could not be acquired.

                                                                                                                                                                                                                                                                                                                                                    lockForWriting is used when:

                                                                                                                                                                                                                                                                                                                                                    • BlockInfoManager is requested to lockNewBlockForWriting
                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to removeBlock
                                                                                                                                                                                                                                                                                                                                                    • MemoryStore is requested to evictBlocksToFreeSpace
                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockInfoManager/#obtaining-write-lock-for-new-block","title":"Obtaining Write Lock for New Block
                                                                                                                                                                                                                                                                                                                                                    lockNewBlockForWriting(\n  blockId: BlockId,\n  newBlockInfo: BlockInfo): Boolean\n

                                                                                                                                                                                                                                                                                                                                                    lockNewBlockForWriting obtains a write lock for blockId but only when the method could register the block.

                                                                                                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                                                                                                    lockNewBlockForWriting is similar to lockForWriting method but for brand new blocks.

                                                                                                                                                                                                                                                                                                                                                    When executed, lockNewBlockForWriting prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                    Task [currentTaskAttemptId] trying to put [blockId]\n

                                                                                                                                                                                                                                                                                                                                                    If some other thread has already created the block, lockNewBlockForWriting finishes returning false. Otherwise, when the block does not exist, newBlockInfo is recorded in the infos internal registry and the block is locked for this client for writing. lockNewBlockForWriting then returns true.

                                                                                                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                                                                                                    lockNewBlockForWriting executes itself in synchronized block so once the BlockInfoManager is locked the other internal registries should be available for the current thread only.

                                                                                                                                                                                                                                                                                                                                                    lockNewBlockForWriting is used when:

                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to doPut
                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockInfoManager/#releasing-lock-on-block","title":"Releasing Lock on Block
                                                                                                                                                                                                                                                                                                                                                    unlock(\n  blockId: BlockId,\n  taskAttemptId: Option[TaskAttemptId] = None): Unit\n

                                                                                                                                                                                                                                                                                                                                                    unlock prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                    Task [currentTaskAttemptId] releasing lock for [blockId]\n

                                                                                                                                                                                                                                                                                                                                                    unlock gets the metadata for blockId (and throws an IllegalStateException if the block was not found).

                                                                                                                                                                                                                                                                                                                                                    If the writer task for the block is not NO_WRITER, it becomes so and the blockId block is removed from the internal writeLocksByTask registry for the current task attempt.

                                                                                                                                                                                                                                                                                                                                                    Otherwise, if the writer task is indeed NO_WRITER, the block is assumed locked for reading. The readerCount counter is decremented for the blockId block and the read lock removed from the internal readLocksByTask registry for the task attempt.

                                                                                                                                                                                                                                                                                                                                                    In the end, unlock wakes up all the threads waiting for the BlockInfoManager.

                                                                                                                                                                                                                                                                                                                                                    unlock is used when:

                                                                                                                                                                                                                                                                                                                                                    • BlockInfoManager is requested to downgradeLock
                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to releaseLock and doPut
                                                                                                                                                                                                                                                                                                                                                    • BlockManagerManagedBuffer is requested to release
                                                                                                                                                                                                                                                                                                                                                    • MemoryStore is requested to evictBlocksToFreeSpace
                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockInfoManager/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                    Enable ALL logging level for org.apache.spark.storage.BlockInfoManager logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                    Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                    log4j.logger.org.apache.spark.storage.BlockInfoManager=ALL\n

                                                                                                                                                                                                                                                                                                                                                    Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/","title":"BlockManager","text":"

                                                                                                                                                                                                                                                                                                                                                    BlockManager manages the storage for blocks (chunks of data) that can be stored in memory and on disk.

                                                                                                                                                                                                                                                                                                                                                    BlockManager runs as part of the driver and executor processes.

                                                                                                                                                                                                                                                                                                                                                    BlockManager provides interface for uploading and fetching blocks both locally and remotely using various stores (i.e. memory, disk, and off-heap).

                                                                                                                                                                                                                                                                                                                                                    Cached blocks are blocks with non-zero sum of memory and disk sizes.

                                                                                                                                                                                                                                                                                                                                                    Tip

                                                                                                                                                                                                                                                                                                                                                    Use Web UI (esp. Storage and Executors tabs) to monitor the memory used.

                                                                                                                                                                                                                                                                                                                                                    Tip

                                                                                                                                                                                                                                                                                                                                                    Use spark-submit's command-line options (i.e. --driver-memory for the driver and --executor-memory for executors) or their equivalents as Spark properties (i.e. spark.executor.memory and spark.driver.memory) to control the memory for storage memory.

                                                                                                                                                                                                                                                                                                                                                    When External Shuffle Service is enabled, BlockManager uses ExternalShuffleClient to read shuffle files (of other executors).

                                                                                                                                                                                                                                                                                                                                                    "},{"location":"storage/BlockManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                    BlockManager takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                    • Executor ID
                                                                                                                                                                                                                                                                                                                                                    • RpcEnv
                                                                                                                                                                                                                                                                                                                                                    • BlockManagerMaster
                                                                                                                                                                                                                                                                                                                                                    • SerializerManager
                                                                                                                                                                                                                                                                                                                                                    • SparkConf
                                                                                                                                                                                                                                                                                                                                                    • MemoryManager
                                                                                                                                                                                                                                                                                                                                                    • MapOutputTracker
                                                                                                                                                                                                                                                                                                                                                    • ShuffleManager
                                                                                                                                                                                                                                                                                                                                                    • BlockTransferService
                                                                                                                                                                                                                                                                                                                                                    • SecurityManager
                                                                                                                                                                                                                                                                                                                                                    • Optional ExternalBlockStoreClient

                                                                                                                                                                                                                                                                                                                                                      When created, BlockManager sets externalShuffleServiceEnabled internal flag based on spark.shuffle.service.enabled configuration property.

                                                                                                                                                                                                                                                                                                                                                      BlockManager then creates an instance of DiskBlockManager (requesting deleteFilesOnStop when an external shuffle service is not in use).

                                                                                                                                                                                                                                                                                                                                                      BlockManager creates block-manager-future daemon cached thread pool with 128 threads maximum (as futureExecutionContext).

                                                                                                                                                                                                                                                                                                                                                      BlockManager calculates the maximum memory to use (as maxMemory) by requesting the maximum on-heap and off-heap storage memory from the assigned MemoryManager.

                                                                                                                                                                                                                                                                                                                                                      BlockManager calculates the port used by the external shuffle service (as externalShuffleServicePort).

                                                                                                                                                                                                                                                                                                                                                      BlockManager creates a client to read other executors' shuffle files (as shuffleClient). If the external shuffle service is used...FIXME

                                                                                                                                                                                                                                                                                                                                                      BlockManager sets the maximum number of failures before this block manager refreshes the block locations from the driver (as maxFailuresBeforeLocationRefresh).

                                                                                                                                                                                                                                                                                                                                                      BlockManager registers a BlockManagerSlaveEndpoint with the input RpcEnv, itself, and MapOutputTracker (as slaveEndpoint).

                                                                                                                                                                                                                                                                                                                                                      BlockManager is created when SparkEnv is created (for the driver and executors) when a Spark application starts.

                                                                                                                                                                                                                                                                                                                                                      "},{"location":"storage/BlockManager/#memorymanager","title":"MemoryManager

                                                                                                                                                                                                                                                                                                                                                      BlockManager is given a MemoryManager when created.

                                                                                                                                                                                                                                                                                                                                                      BlockManager uses the MemoryManager for the following:

                                                                                                                                                                                                                                                                                                                                                      • Create a MemoryStore (that is then assigned to MemoryManager as a \"circular dependency\")

                                                                                                                                                                                                                                                                                                                                                      • Initialize maxOnHeapMemory and maxOffHeapMemory (for reporting)

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#diskblockmanager","title":"DiskBlockManager

                                                                                                                                                                                                                                                                                                                                                      BlockManager creates a DiskBlockManager when created.

                                                                                                                                                                                                                                                                                                                                                      BlockManager uses the BlockManager for the following:

                                                                                                                                                                                                                                                                                                                                                      • Creating a DiskStore
                                                                                                                                                                                                                                                                                                                                                      • Registering an executor with a local external shuffle service (when initialized on an executor with externalShuffleServiceEnabled)

                                                                                                                                                                                                                                                                                                                                                      The DiskBlockManager is available as diskBlockManager reference to other Spark systems.

                                                                                                                                                                                                                                                                                                                                                      import org.apache.spark.SparkEnv\nSparkEnv.get.blockManager.diskBlockManager\n
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#migratableresolver","title":"MigratableResolver
                                                                                                                                                                                                                                                                                                                                                      migratableResolver: MigratableResolver\n

                                                                                                                                                                                                                                                                                                                                                      BlockManager creates a reference to a MigratableResolver by requesting the ShuffleManager for the ShuffleBlockResolver (that is assumed a MigratableResolver).

                                                                                                                                                                                                                                                                                                                                                      Lazy Value

                                                                                                                                                                                                                                                                                                                                                      migratableResolver is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

                                                                                                                                                                                                                                                                                                                                                      private[storage]

                                                                                                                                                                                                                                                                                                                                                      migratableResolver is a private[storage] so it is available to others in the org.apache.spark.storage package.

                                                                                                                                                                                                                                                                                                                                                      migratableResolver is used when:

                                                                                                                                                                                                                                                                                                                                                      • BlockManager is requested to putBlockDataAsStream
                                                                                                                                                                                                                                                                                                                                                      • ShuffleMigrationRunnable is requested to run
                                                                                                                                                                                                                                                                                                                                                      • BlockManagerDecommissioner is requested to refreshOffloadingShuffleBlocks
                                                                                                                                                                                                                                                                                                                                                      • FallbackStorage is requested to copy
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#local-directories-for-block-storage","title":"Local Directories for Block Storage
                                                                                                                                                                                                                                                                                                                                                      getLocalDiskDirs: Array[String]\n

                                                                                                                                                                                                                                                                                                                                                      getLocalDiskDirs\u00a0requests the DiskBlockManager for the local directories for block storage.

                                                                                                                                                                                                                                                                                                                                                      getLocalDiskDirs\u00a0is part of the BlockDataManager abstraction.

                                                                                                                                                                                                                                                                                                                                                      getLocalDiskDirs\u00a0is also used by BlockManager when requested for the following:

                                                                                                                                                                                                                                                                                                                                                      • Register with a local external shuffle service
                                                                                                                                                                                                                                                                                                                                                      • Initialize
                                                                                                                                                                                                                                                                                                                                                      • Re-register
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#initializing-blockmanager","title":"Initializing BlockManager
                                                                                                                                                                                                                                                                                                                                                      initialize(\n  appId: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                      initialize requests the BlockTransferService to initialize.

                                                                                                                                                                                                                                                                                                                                                      initialize requests the ExternalBlockStoreClient to initialize (if given).

                                                                                                                                                                                                                                                                                                                                                      initialize determines the BlockReplicationPolicy based on spark.storage.replication.policy configuration property and prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                      Using [priorityClass] for block replication policy\n

                                                                                                                                                                                                                                                                                                                                                      initialize creates a BlockManagerId and requests the BlockManagerMaster to registerBlockManager (with the BlockManagerId, the local directories of the DiskBlockManager, the maxOnHeapMemory, the maxOffHeapMemory and the slaveEndpoint).

                                                                                                                                                                                                                                                                                                                                                      initialize sets the internal BlockManagerId to be the response from the BlockManagerMaster (if available) or the BlockManagerId just created.

                                                                                                                                                                                                                                                                                                                                                      initialize initializes the External Shuffle Server's Address when enabled and prints out the following INFO message to the logs (with the externalShuffleServicePort):

                                                                                                                                                                                                                                                                                                                                                      external shuffle service port = [externalShuffleServicePort]\n

                                                                                                                                                                                                                                                                                                                                                      (only for executors and External Shuffle Service enabled) initialize registers with the External Shuffle Server.

                                                                                                                                                                                                                                                                                                                                                      initialize determines the hostLocalDirManager. With spark.shuffle.readHostLocalDisk configuration property enabled and spark.shuffle.useOldFetchProtocol disabled, initialize uses the ExternalBlockStoreClient to create a HostLocalDirManager (with spark.storage.localDiskByExecutors.cacheSize configuration property).

                                                                                                                                                                                                                                                                                                                                                      In the end, initialize prints out the following INFO message to the logs (with the blockManagerId):

                                                                                                                                                                                                                                                                                                                                                      Initialized BlockManager: [blockManagerId]\n

                                                                                                                                                                                                                                                                                                                                                      initialize is used when:

                                                                                                                                                                                                                                                                                                                                                      • SparkContext is created (on the driver)
                                                                                                                                                                                                                                                                                                                                                      • Executor is created (with isLocal flag disabled)
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#registering-executors-blockmanager-with-external-shuffle-server","title":"Registering Executor's BlockManager with External Shuffle Server
                                                                                                                                                                                                                                                                                                                                                      registerWithExternalShuffleServer(): Unit\n

                                                                                                                                                                                                                                                                                                                                                      registerWithExternalShuffleServer registers the BlockManager (for an executor) with External Shuffle Service.

                                                                                                                                                                                                                                                                                                                                                      registerWithExternalShuffleServer prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                      Registering executor with local external shuffle service.\n

                                                                                                                                                                                                                                                                                                                                                      registerWithExternalShuffleServer creates an ExecutorShuffleInfo (with the localDirs and subDirsPerLocalDir of the DiskBlockManager, and the class name of the ShuffleManager).

                                                                                                                                                                                                                                                                                                                                                      registerWithExternalShuffleServer uses spark.shuffle.registration.maxAttempts configuration property and 5 sleep time when requesting the ExternalBlockStoreClient to registerWithShuffleServer (using the BlockManagerId and the ExecutorShuffleInfo).

                                                                                                                                                                                                                                                                                                                                                      In case of any exception that happen below the maximum number of attempts, registerWithExternalShuffleServer prints out the following ERROR message to the logs and sleeps 5 seconds:

                                                                                                                                                                                                                                                                                                                                                      Failed to connect to external shuffle server, will retry [attempts] more times after waiting 5 seconds...\n
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#blockmanagerid","title":"BlockManagerId

                                                                                                                                                                                                                                                                                                                                                      BlockManager uses a BlockManagerId for...FIXME

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#hostlocaldirmanager","title":"HostLocalDirManager

                                                                                                                                                                                                                                                                                                                                                      BlockManager can use a HostLocalDirManager.

                                                                                                                                                                                                                                                                                                                                                      Default: (undefined)

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#blockreplicationpolicy","title":"BlockReplicationPolicy

                                                                                                                                                                                                                                                                                                                                                      BlockManager uses a BlockReplicationPolicy for...FIXME

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#external-shuffle-services-port","title":"External Shuffle Service's Port

                                                                                                                                                                                                                                                                                                                                                      BlockManager determines the port of an external shuffle service when created.

                                                                                                                                                                                                                                                                                                                                                      The port is used to create the shuffleServerId and a HostLocalDirManager.

                                                                                                                                                                                                                                                                                                                                                      The port is also used for preferExecutors.

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#sparkdiskstoresubdirectories-configuration-property","title":"spark.diskStore.subDirectories Configuration Property

                                                                                                                                                                                                                                                                                                                                                      BlockManager uses spark.diskStore.subDirectories configuration property to initialize a subDirsPerLocalDir local value.

                                                                                                                                                                                                                                                                                                                                                      subDirsPerLocalDir is used when:

                                                                                                                                                                                                                                                                                                                                                      • IndexShuffleBlockResolver is requested to getDataFile and getIndexFile
                                                                                                                                                                                                                                                                                                                                                      • BlockManager is requested to readDiskBlockFromSameHostExecutor
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#fetching-block-or-computing-and-storing-it","title":"Fetching Block or Computing (and Storing) it
                                                                                                                                                                                                                                                                                                                                                      getOrElseUpdate[T](\n  blockId: BlockId,\n  level: StorageLevel,\n  classTag: ClassTag[T],\n  makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]]\n

                                                                                                                                                                                                                                                                                                                                                      Map.getOrElseUpdate

                                                                                                                                                                                                                                                                                                                                                      I think it is fair to say that getOrElseUpdate is like getOrElseUpdate of scala.collection.mutable.Map in Scala.

                                                                                                                                                                                                                                                                                                                                                      getOrElseUpdate(key: K, op: \u21d2 V): V\n

                                                                                                                                                                                                                                                                                                                                                      Quoting the official scaladoc:

                                                                                                                                                                                                                                                                                                                                                      If given key K is already in this map, getOrElseUpdate returns the associated value V.

                                                                                                                                                                                                                                                                                                                                                      Otherwise, getOrElseUpdate computes a value V from given expression op, stores with the key K in the map and returns that value.

                                                                                                                                                                                                                                                                                                                                                      Since BlockManager is a key-value store of blocks of data identified by a block ID that seems to fit so well.

                                                                                                                                                                                                                                                                                                                                                      getOrElseUpdate first attempts to get the block by the BlockId (from the local block manager first and, if unavailable, requesting remote peers).

                                                                                                                                                                                                                                                                                                                                                      getOrElseUpdate gives the BlockResult of the block if found.

                                                                                                                                                                                                                                                                                                                                                      If however the block was not found (in any block manager in a Spark cluster), getOrElseUpdate doPutIterator (for the input BlockId, the makeIterator function and the StorageLevel).

                                                                                                                                                                                                                                                                                                                                                      getOrElseUpdate branches off per the result:

                                                                                                                                                                                                                                                                                                                                                      • For None, getOrElseUpdate getLocalValues for the BlockId and eventually returns the BlockResult (unless terminated by a SparkException due to some internal error)
                                                                                                                                                                                                                                                                                                                                                      • For Some(iter), getOrElseUpdate returns an iterator of T values

                                                                                                                                                                                                                                                                                                                                                      getOrElseUpdate is used when:

                                                                                                                                                                                                                                                                                                                                                      • RDD is requested to get or compute an RDD partition (for an RDDBlockId with the RDD's id and partition index).
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#fetching-block","title":"Fetching Block
                                                                                                                                                                                                                                                                                                                                                      get[T: ClassTag](\n  blockId: BlockId): Option[BlockResult]\n

                                                                                                                                                                                                                                                                                                                                                      get attempts to fetch the block (BlockId) from a local block manager first before requesting it from remote block managers. get returns a BlockResult or None (to denote \"a block is not available\").

                                                                                                                                                                                                                                                                                                                                                      Internally, get tries to fetch the block from the local BlockManager. If found, get prints out the following INFO message to the logs and returns a BlockResult.

                                                                                                                                                                                                                                                                                                                                                      Found block [blockId] locally\n

                                                                                                                                                                                                                                                                                                                                                      If however the block was not found locally, get tries to fetch the block from remote BlockManagers. If fetched, get prints out the following INFO message to the logs and returns a BlockResult.

                                                                                                                                                                                                                                                                                                                                                      Found block [blockId] remotely\n
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#getremotevalues","title":"getRemoteValues
                                                                                                                                                                                                                                                                                                                                                      getRemoteValues[T: ClassTag](\n  blockId: BlockId): Option[BlockResult]\n

                                                                                                                                                                                                                                                                                                                                                      getRemoteValues getRemoteBlock with the bufferTransformer function that takes a ManagedBuffer and does the following:

                                                                                                                                                                                                                                                                                                                                                      • Requests the SerializerManager to deserialize values from an input stream from the ManagedBuffer
                                                                                                                                                                                                                                                                                                                                                      • Creates a BlockResult with the values (and their total size, and Network read method)
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#fetching-block-bytes-from-remote-block-managers","title":"Fetching Block Bytes From Remote Block Managers
                                                                                                                                                                                                                                                                                                                                                      getRemoteBytes(\n  blockId: BlockId): Option[ChunkedByteBuffer]\n

                                                                                                                                                                                                                                                                                                                                                      getRemoteBytes getRemoteBlock with the bufferTransformer function that takes a ManagedBuffer and creates a ChunkedByteBuffer.

                                                                                                                                                                                                                                                                                                                                                      getRemoteBytes is used when:

                                                                                                                                                                                                                                                                                                                                                      • TorrentBroadcast is requested to readBlocks
                                                                                                                                                                                                                                                                                                                                                      • TaskResultGetter is requested to enqueueSuccessfulTask
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#fetching-remote-block","title":"Fetching Remote Block
                                                                                                                                                                                                                                                                                                                                                      getRemoteBlock[T](\n  blockId: BlockId,\n  bufferTransformer: ManagedBuffer => T): Option[T]\n

                                                                                                                                                                                                                                                                                                                                                      getRemoteBlock\u00a0is used for getRemoteValues and getRemoteBytes.

                                                                                                                                                                                                                                                                                                                                                      getRemoteBlock prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                      Getting remote block [blockId]\n

                                                                                                                                                                                                                                                                                                                                                      getRemoteBlock requests the BlockManagerMaster for locations and status of the input BlockId (with the host of BlockManagerId).

                                                                                                                                                                                                                                                                                                                                                      With some locations, getRemoteBlock determines the size of the block (max of diskSize and memSize). getRemoteBlock tries to read the block from the local directories of another executor on the same host. getRemoteBlock prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                      Read [blockId] from the disk of a same host executor is [successful|failed].\n

                                                                                                                                                                                                                                                                                                                                                      When a data block could not be found in any of the local directories, getRemoteBlock fetchRemoteManagedBuffer.

                                                                                                                                                                                                                                                                                                                                                      For no locations from the BlockManagerMaster, getRemoteBlock prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#readdiskblockfromsamehostexecutor","title":"readDiskBlockFromSameHostExecutor
                                                                                                                                                                                                                                                                                                                                                      readDiskBlockFromSameHostExecutor(\n  blockId: BlockId,\n  localDirs: Array[String],\n  blockSize: Long): Option[ManagedBuffer]\n

                                                                                                                                                                                                                                                                                                                                                      readDiskBlockFromSameHostExecutor...FIXME

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#fetchremotemanagedbuffer","title":"fetchRemoteManagedBuffer
                                                                                                                                                                                                                                                                                                                                                      fetchRemoteManagedBuffer(\n  blockId: BlockId,\n  blockSize: Long,\n  locationsAndStatus: BlockManagerMessages.BlockLocationsAndStatus): Option[ManagedBuffer]\n

                                                                                                                                                                                                                                                                                                                                                      fetchRemoteManagedBuffer...FIXME

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#sortlocations","title":"sortLocations
                                                                                                                                                                                                                                                                                                                                                      sortLocations(\n  locations: Seq[BlockManagerId]): Seq[BlockManagerId]\n

                                                                                                                                                                                                                                                                                                                                                      sortLocations...FIXME

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#preferexecutors","title":"preferExecutors
                                                                                                                                                                                                                                                                                                                                                      preferExecutors(\n  locations: Seq[BlockManagerId]): Seq[BlockManagerId]\n

                                                                                                                                                                                                                                                                                                                                                      preferExecutors...FIXME

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#readdiskblockfromsamehostexecutor_1","title":"readDiskBlockFromSameHostExecutor
                                                                                                                                                                                                                                                                                                                                                      readDiskBlockFromSameHostExecutor(\n  blockId: BlockId,\n  localDirs: Array[String],\n  blockSize: Long): Option[ManagedBuffer]\n

                                                                                                                                                                                                                                                                                                                                                      readDiskBlockFromSameHostExecutor...FIXME

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#executioncontextexecutorservice","title":"ExecutionContextExecutorService

                                                                                                                                                                                                                                                                                                                                                      BlockManager uses a Scala ExecutionContextExecutorService to execute FIXME asynchronously (on a thread pool with block-manager-future prefix and maximum of 128 threads).

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#blockevictionhandler","title":"BlockEvictionHandler

                                                                                                                                                                                                                                                                                                                                                      BlockManager is a BlockEvictionHandler that can drop a block from memory (and store it on a disk when necessary).

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#shuffleclient-and-external-shuffle-service","title":"ShuffleClient and External Shuffle Service

                                                                                                                                                                                                                                                                                                                                                      Danger

                                                                                                                                                                                                                                                                                                                                                      FIXME ShuffleClient and ExternalShuffleClient are dead. Long live BlockStoreClient and ExternalBlockStoreClient.

                                                                                                                                                                                                                                                                                                                                                      BlockManager manages the lifecycle of a ShuffleClient:

                                                                                                                                                                                                                                                                                                                                                      • Creates when created

                                                                                                                                                                                                                                                                                                                                                      • Inits (and possibly registers with an external shuffle server) when requested to initialize

                                                                                                                                                                                                                                                                                                                                                      • Closes when requested to stop

                                                                                                                                                                                                                                                                                                                                                      The ShuffleClient can be an ExternalShuffleClient or the given BlockTransferService based on spark.shuffle.service.enabled configuration property. When enabled, BlockManager uses the ExternalShuffleClient.

                                                                                                                                                                                                                                                                                                                                                      The ShuffleClient is available to other Spark services (using shuffleClient value) and is used when BlockStoreShuffleReader is requested to read combined key-value records for a reduce task.

                                                                                                                                                                                                                                                                                                                                                      When requested for shuffle metrics, BlockManager simply requests them from the ShuffleClient.

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#blockmanager-and-rpcenv","title":"BlockManager and RpcEnv

                                                                                                                                                                                                                                                                                                                                                      BlockManager is given a RpcEnv when created.

                                                                                                                                                                                                                                                                                                                                                      The RpcEnv is used to set up a BlockManagerSlaveEndpoint.

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#blockinfomanager","title":"BlockInfoManager

                                                                                                                                                                                                                                                                                                                                                      BlockManager creates a BlockInfoManager when created.

                                                                                                                                                                                                                                                                                                                                                      BlockManager requests the BlockInfoManager to clear when requested to stop.

                                                                                                                                                                                                                                                                                                                                                      BlockManager uses the BlockInfoManager to create a MemoryStore.

                                                                                                                                                                                                                                                                                                                                                      BlockManager uses the BlockInfoManager when requested for the following:

                                                                                                                                                                                                                                                                                                                                                      • reportAllBlocks

                                                                                                                                                                                                                                                                                                                                                      • getStatus

                                                                                                                                                                                                                                                                                                                                                      • getMatchingBlockIds

                                                                                                                                                                                                                                                                                                                                                      • getLocalValues and getLocalBytes

                                                                                                                                                                                                                                                                                                                                                      • doPut

                                                                                                                                                                                                                                                                                                                                                      • replicateBlock

                                                                                                                                                                                                                                                                                                                                                      • dropFromMemory

                                                                                                                                                                                                                                                                                                                                                      • removeRdd, removeBroadcast, removeBlock, removeBlockInternal

                                                                                                                                                                                                                                                                                                                                                      • downgradeLock, releaseLock, registerTask, releaseAllLocksForTask

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#blockmanager-and-blockmanagermaster","title":"BlockManager and BlockManagerMaster

                                                                                                                                                                                                                                                                                                                                                      BlockManager is given a BlockManagerMaster when created.

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#blockmanager-as-blockdatamanager","title":"BlockManager as BlockDataManager

                                                                                                                                                                                                                                                                                                                                                      BlockManager is a BlockDataManager.

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#blockmanager-and-mapoutputtracker","title":"BlockManager and MapOutputTracker

                                                                                                                                                                                                                                                                                                                                                      BlockManager is given a MapOutputTracker when created.

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#executor-id","title":"Executor ID

                                                                                                                                                                                                                                                                                                                                                      BlockManager is given an Executor ID when created.

                                                                                                                                                                                                                                                                                                                                                      The Executor ID is one of the following:

                                                                                                                                                                                                                                                                                                                                                      • driver (SparkContext.DRIVER_IDENTIFIER) for the driver

                                                                                                                                                                                                                                                                                                                                                      • Value of --executor-id command-line argument for CoarseGrainedExecutorBackend executors

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#blockmanagerendpoint-rpc-endpoint","title":"BlockManagerEndpoint RPC Endpoint

                                                                                                                                                                                                                                                                                                                                                      BlockManager requests the RpcEnv to register a BlockManagerSlaveEndpoint under the name BlockManagerEndpoint[ID].

                                                                                                                                                                                                                                                                                                                                                      The RPC endpoint is used when BlockManager is requested to initialize and reregister (to register the BlockManager on an executor with the BlockManagerMaster on the driver).

                                                                                                                                                                                                                                                                                                                                                      The endpoint is stopped (by requesting the RpcEnv to stop the reference) when BlockManager is requested to stop.

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#accessing-blockmanager","title":"Accessing BlockManager

                                                                                                                                                                                                                                                                                                                                                      BlockManager is available using SparkEnv on the driver and executors.

                                                                                                                                                                                                                                                                                                                                                      import org.apache.spark.SparkEnv\nval bm = SparkEnv.get.blockManager\n\nscala> :type bm\norg.apache.spark.storage.BlockManager\n
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#blockstoreclient","title":"BlockStoreClient

                                                                                                                                                                                                                                                                                                                                                      BlockManager uses a BlockStoreClient to read other executors' blocks. This is an ExternalBlockStoreClient (when given and an external shuffle service is used) or a BlockTransferService (to directly connect to other executors).

                                                                                                                                                                                                                                                                                                                                                      This BlockStoreClient is used when:

                                                                                                                                                                                                                                                                                                                                                      • BlockStoreShuffleReader is requested to read combined key-values for a reduce task
                                                                                                                                                                                                                                                                                                                                                      • Create the HostLocalDirManager (when BlockManager is initialized)
                                                                                                                                                                                                                                                                                                                                                      • As the shuffleMetricsSource
                                                                                                                                                                                                                                                                                                                                                      • registerWithExternalShuffleServer (when an external shuffle server is used and the ExternalBlockStoreClient defined)
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#blocktransferservice","title":"BlockTransferService

                                                                                                                                                                                                                                                                                                                                                      BlockManager is given a BlockTransferService when created.

                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                      There is only one concrete BlockTransferService that is NettyBlockTransferService and there seem to be no way to reconfigure Apache Spark to use a different implementation (if there were any).

                                                                                                                                                                                                                                                                                                                                                      BlockTransferService is used when BlockManager is requested to fetch a block from and replicate a block to remote block managers.

                                                                                                                                                                                                                                                                                                                                                      BlockTransferService is used as the BlockStoreClient (unless an ExternalBlockStoreClient is specified).

                                                                                                                                                                                                                                                                                                                                                      BlockTransferService is initialized with this BlockManager.

                                                                                                                                                                                                                                                                                                                                                      BlockTransferService is closed when BlockManager is requested to stop.

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#shufflemanager","title":"ShuffleManager

                                                                                                                                                                                                                                                                                                                                                      BlockManager is given a ShuffleManager when created.

                                                                                                                                                                                                                                                                                                                                                      BlockManager uses the ShuffleManager for the following:

                                                                                                                                                                                                                                                                                                                                                      • Retrieving a block data (for shuffle blocks)

                                                                                                                                                                                                                                                                                                                                                      • Retrieving a non-shuffle block data (for shuffle blocks anyway)

                                                                                                                                                                                                                                                                                                                                                      • Registering an executor with a local external shuffle service (when initialized on an executor with externalShuffleServiceEnabled)

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#memorystore","title":"MemoryStore

                                                                                                                                                                                                                                                                                                                                                      BlockManager creates a MemoryStore when created (with the BlockInfoManager, the SerializerManager, the MemoryManager and itself as a BlockEvictionHandler).

                                                                                                                                                                                                                                                                                                                                                      BlockManager requests the MemoryManager to use the MemoryStore.

                                                                                                                                                                                                                                                                                                                                                      BlockManager uses the MemoryStore for the following:

                                                                                                                                                                                                                                                                                                                                                      • getStatus and getCurrentBlockStatus

                                                                                                                                                                                                                                                                                                                                                      • getLocalValues

                                                                                                                                                                                                                                                                                                                                                      • doGetLocalBytes

                                                                                                                                                                                                                                                                                                                                                      • doPutBytes and doPutIterator

                                                                                                                                                                                                                                                                                                                                                      • maybeCacheDiskBytesInMemory and maybeCacheDiskValuesInMemory

                                                                                                                                                                                                                                                                                                                                                      • dropFromMemory

                                                                                                                                                                                                                                                                                                                                                      • removeBlockInternal

                                                                                                                                                                                                                                                                                                                                                      The MemoryStore is requested to clear when BlockManager is requested to stop.

                                                                                                                                                                                                                                                                                                                                                      The MemoryStore is available as memoryStore private reference to other Spark services.

                                                                                                                                                                                                                                                                                                                                                      import org.apache.spark.SparkEnv\nSparkEnv.get.blockManager.memoryStore\n

                                                                                                                                                                                                                                                                                                                                                      The MemoryStore is used (via SparkEnv.get.blockManager.memoryStore reference) when Task is requested to run (that has just finished execution and requests the MemoryStore to release unroll memory).

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#diskstore","title":"DiskStore

                                                                                                                                                                                                                                                                                                                                                      BlockManager creates a DiskStore (with the DiskBlockManager) when created.

                                                                                                                                                                                                                                                                                                                                                      BlockManager uses the DiskStore when requested for the following:

                                                                                                                                                                                                                                                                                                                                                      • getStatus
                                                                                                                                                                                                                                                                                                                                                      • getCurrentBlockStatus
                                                                                                                                                                                                                                                                                                                                                      • getLocalValues
                                                                                                                                                                                                                                                                                                                                                      • doGetLocalBytes
                                                                                                                                                                                                                                                                                                                                                      • doPutIterator
                                                                                                                                                                                                                                                                                                                                                      • dropFromMemory
                                                                                                                                                                                                                                                                                                                                                      • removeBlockInternal

                                                                                                                                                                                                                                                                                                                                                      DiskStore is used when:

                                                                                                                                                                                                                                                                                                                                                      • ByteBufferBlockStoreUpdater is requested to saveToDiskStore
                                                                                                                                                                                                                                                                                                                                                      • TempFileBasedBlockStoreUpdater is requested to blockData and saveToDiskStore
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#performance-metrics","title":"Performance Metrics

                                                                                                                                                                                                                                                                                                                                                      BlockManager uses BlockManagerSource to report metrics under the name BlockManager.

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#getpeers","title":"getPeers
                                                                                                                                                                                                                                                                                                                                                      getPeers(\n  forceFetch: Boolean): Seq[BlockManagerId]\n

                                                                                                                                                                                                                                                                                                                                                      getPeers...FIXME

                                                                                                                                                                                                                                                                                                                                                      getPeers is used when BlockManager is requested to replicateBlock and replicate.

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#releasing-all-locks-for-task","title":"Releasing All Locks For Task
                                                                                                                                                                                                                                                                                                                                                      releaseAllLocksForTask(\n  taskAttemptId: Long): Seq[BlockId]\n

                                                                                                                                                                                                                                                                                                                                                      releaseAllLocksForTask...FIXME

                                                                                                                                                                                                                                                                                                                                                      releaseAllLocksForTask is used when TaskRunner is requested to run (at the end of a task).

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#stopping-blockmanager","title":"Stopping BlockManager
                                                                                                                                                                                                                                                                                                                                                      stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                      stop...FIXME

                                                                                                                                                                                                                                                                                                                                                      stop is used when SparkEnv is requested to stop.

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#getting-ids-of-existing-blocks-for-a-given-filter","title":"Getting IDs of Existing Blocks (For a Given Filter)
                                                                                                                                                                                                                                                                                                                                                      getMatchingBlockIds(\n  filter: BlockId => Boolean): Seq[BlockId]\n

                                                                                                                                                                                                                                                                                                                                                      getMatchingBlockIds...FIXME

                                                                                                                                                                                                                                                                                                                                                      getMatchingBlockIds is used when BlockManagerSlaveEndpoint is requested to handle a GetMatchingBlockIds message.

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#getting-local-block","title":"Getting Local Block
                                                                                                                                                                                                                                                                                                                                                      getLocalValues(\n  blockId: BlockId): Option[BlockResult]\n

                                                                                                                                                                                                                                                                                                                                                      getLocalValues prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                      Getting local block [blockId]\n

                                                                                                                                                                                                                                                                                                                                                      getLocalValues obtains a read lock for blockId.

                                                                                                                                                                                                                                                                                                                                                      When no blockId block was found, you should see the following DEBUG message in the logs and getLocalValues returns \"nothing\" (i.e. NONE).

                                                                                                                                                                                                                                                                                                                                                      Block [blockId] was not found\n

                                                                                                                                                                                                                                                                                                                                                      When the blockId block was found, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                      Level for block [blockId] is [level]\n

                                                                                                                                                                                                                                                                                                                                                      If blockId block has memory level and is registered in MemoryStore, getLocalValues returns a BlockResult as Memory read method and with a CompletionIterator for an interator:

                                                                                                                                                                                                                                                                                                                                                      1. Values iterator from MemoryStore for blockId for \"deserialized\" persistence levels.
                                                                                                                                                                                                                                                                                                                                                      2. Iterator from SerializerManager after the data stream has been deserialized for the blockId block and the bytes for blockId block for \"serialized\" persistence levels.

                                                                                                                                                                                                                                                                                                                                                      getLocalValues is used when:

                                                                                                                                                                                                                                                                                                                                                      • TorrentBroadcast is requested to readBroadcastBlock

                                                                                                                                                                                                                                                                                                                                                      • BlockManager is requested to get and getOrElseUpdate

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#maybecachediskvaluesinmemory","title":"maybeCacheDiskValuesInMemory
                                                                                                                                                                                                                                                                                                                                                      maybeCacheDiskValuesInMemory[T](\n  blockInfo: BlockInfo,\n  blockId: BlockId,\n  level: StorageLevel,\n  diskIterator: Iterator[T]): Iterator[T]\n

                                                                                                                                                                                                                                                                                                                                                      maybeCacheDiskValuesInMemory...FIXME

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#retrieving-block-data","title":"Retrieving Block Data
                                                                                                                                                                                                                                                                                                                                                      getBlockData(\n  blockId: BlockId): ManagedBuffer\n

                                                                                                                                                                                                                                                                                                                                                      getBlockData is part of the BlockDataManager abstraction.

                                                                                                                                                                                                                                                                                                                                                      For a BlockId.md[] of a shuffle (a ShuffleBlockId), getBlockData requests the <> for the shuffle:ShuffleManager.md#shuffleBlockResolver[ShuffleBlockResolver] that is then requested for shuffle:ShuffleBlockResolver.md#getBlockData[getBlockData].

                                                                                                                                                                                                                                                                                                                                                      Otherwise, getBlockData <> for the given BlockId.

                                                                                                                                                                                                                                                                                                                                                      If found, getBlockData creates a new BlockManagerManagedBuffer (with the <>, the input BlockId, the retrieved BlockData and the dispose flag enabled).

                                                                                                                                                                                                                                                                                                                                                      If not found, getBlockData <> that the block could not be found (and that the master should no longer assume the block is available on this executor) and throws a BlockNotFoundException.

                                                                                                                                                                                                                                                                                                                                                      NOTE: getBlockData is executed for shuffle blocks or local blocks that the BlockManagerMaster knows this executor really has (unless BlockManagerMaster is outdated).

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#retrieving-non-shuffle-local-block-data","title":"Retrieving Non-Shuffle Local Block Data
                                                                                                                                                                                                                                                                                                                                                      getLocalBytes(\n  blockId: BlockId): Option[BlockData]\n

                                                                                                                                                                                                                                                                                                                                                      getLocalBytes...FIXME

                                                                                                                                                                                                                                                                                                                                                      getLocalBytes is used when:

                                                                                                                                                                                                                                                                                                                                                      • TorrentBroadcast is requested to readBlocks
                                                                                                                                                                                                                                                                                                                                                      • BlockManager is requested for the block data (of a non-shuffle block)
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#storing-block-data-locally","title":"Storing Block Data Locally
                                                                                                                                                                                                                                                                                                                                                      putBlockData(\n  blockId: BlockId,\n  data: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Boolean\n

                                                                                                                                                                                                                                                                                                                                                      putBlockData is part of the BlockDataManager abstraction.

                                                                                                                                                                                                                                                                                                                                                      putBlockData putBytes with Java NIO's ByteBuffer of the given ManagedBuffer.

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#storing-block-bytebuffer-locally","title":"Storing Block (ByteBuffer) Locally
                                                                                                                                                                                                                                                                                                                                                      putBytes(\n  blockId: BlockId,\n  bytes: ChunkedByteBuffer,\n  level: StorageLevel,\n  tellMaster: Boolean = true): Boolean\n

                                                                                                                                                                                                                                                                                                                                                      putBytes creates a ByteBufferBlockStoreUpdater that is then requested to store the bytes.

                                                                                                                                                                                                                                                                                                                                                      putBytes is used when:

                                                                                                                                                                                                                                                                                                                                                      • BlockManager is requested to puts a block data locally
                                                                                                                                                                                                                                                                                                                                                      • TaskRunner is requested to run (and the result size is above maxDirectResultSize)
                                                                                                                                                                                                                                                                                                                                                      • TorrentBroadcast is requested to writeBlocks and readBlocks
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#doputbytes","title":"doPutBytes
                                                                                                                                                                                                                                                                                                                                                      doPutBytes[T](\n  blockId: BlockId,\n  bytes: ChunkedByteBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[T],\n  tellMaster: Boolean = true,\n  keepReadLock: Boolean = false): Boolean\n

                                                                                                                                                                                                                                                                                                                                                      doPutBytes calls the internal helper <> with a function that accepts a BlockInfo and does the uploading.

                                                                                                                                                                                                                                                                                                                                                      Inside the function, if the StorageLevel.md[storage level]'s replication is greater than 1, it immediately starts <> of the blockId block on a separate thread (from futureExecutionContext thread pool). The replication uses the input bytes and level storage level.

                                                                                                                                                                                                                                                                                                                                                      For a memory storage level, the function checks whether the storage level is deserialized or not. For a deserialized storage level, BlockManager's serializer:SerializerManager.md#dataDeserializeStream[SerializerManager deserializes bytes into an iterator of values] that MemoryStore.md#putIteratorAsValues[MemoryStore stores]. If however the storage level is not deserialized, the function requests MemoryStore.md#putBytes[MemoryStore to store the bytes]

                                                                                                                                                                                                                                                                                                                                                      If the put did not succeed and the storage level is to use disk, you should see the following WARN message in the logs:

                                                                                                                                                                                                                                                                                                                                                      Persisting block [blockId] to disk instead.\n

                                                                                                                                                                                                                                                                                                                                                      And DiskStore.md#putBytes[DiskStore stores the bytes].

                                                                                                                                                                                                                                                                                                                                                      NOTE: DiskStore.md[DiskStore] is requested to store the bytes of a block with memory and disk storage level only when MemoryStore.md[MemoryStore] has failed.

                                                                                                                                                                                                                                                                                                                                                      If the storage level is to use disk only, DiskStore.md#putBytes[DiskStore stores the bytes].

                                                                                                                                                                                                                                                                                                                                                      doPutBytes requests <> and if the block was successfully stored, and the driver should know about it (tellMaster), the function <>. The executor:TaskMetrics.md#incUpdatedBlockStatuses[current TaskContext metrics are updated with the updated block status] (only when executed inside a task where TaskContext is available).

                                                                                                                                                                                                                                                                                                                                                      You should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                      Put block [blockId] locally took [time] ms\n

                                                                                                                                                                                                                                                                                                                                                      The function waits till the earlier asynchronous replication finishes for a block with replication level greater than 1.

                                                                                                                                                                                                                                                                                                                                                      The final result of doPutBytes is the result of storing the block successful or not (as computed earlier).

                                                                                                                                                                                                                                                                                                                                                      NOTE: doPutBytes is used exclusively when BlockManager is requested to <>.","text":""},{"location":"storage/BlockManager/#putting-new-block","title":"Putting New Block

                                                                                                                                                                                                                                                                                                                                                      doPut[T](\n  blockId: BlockId,\n  level: StorageLevel,\n  classTag: ClassTag[_],\n  tellMaster: Boolean,\n  keepReadLock: Boolean)(putBody: BlockInfo => Option[T]): Option[T]\n

                                                                                                                                                                                                                                                                                                                                                      doPut requires that the given StorageLevel is valid.

                                                                                                                                                                                                                                                                                                                                                      doPut creates a new BlockInfo and requests the BlockInfoManager for a write lock for the block.

                                                                                                                                                                                                                                                                                                                                                      doPut executes the given putBody function (with the BlockInfo).

                                                                                                                                                                                                                                                                                                                                                      If the result of putBody function is None, the block is considered saved successfully.

                                                                                                                                                                                                                                                                                                                                                      For successful save, doPut requests the BlockInfoManager to downgradeLock or unlock based on the given keepReadLock flag (true and false, respectively).

                                                                                                                                                                                                                                                                                                                                                      For unsuccessful save (when putBody returned some value), doPut removeBlockInternal and prints out the following WARN message to the logs:

                                                                                                                                                                                                                                                                                                                                                      Putting block [blockId] failed\n

                                                                                                                                                                                                                                                                                                                                                      In the end, doPut prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                      Putting block [blockId] [withOrWithout] replication took [usedTime] ms\n

                                                                                                                                                                                                                                                                                                                                                      doPut is used when:

                                                                                                                                                                                                                                                                                                                                                      • BlockStoreUpdater is requested to save
                                                                                                                                                                                                                                                                                                                                                      • BlockManager is requested to doPutIterator
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#removing-block","title":"Removing Block
                                                                                                                                                                                                                                                                                                                                                      removeBlock(\n  blockId: BlockId,\n  tellMaster: Boolean = true): Unit\n

                                                                                                                                                                                                                                                                                                                                                      removeBlock prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                      Removing block [blockId]\n

                                                                                                                                                                                                                                                                                                                                                      removeBlock requests the BlockInfoManager for write lock on the block.

                                                                                                                                                                                                                                                                                                                                                      With a write lock on the block, removeBlock removeBlockInternal (with the tellMaster flag turned on when the input tellMaster flag and the tellMaster of the block itself are both turned on).

                                                                                                                                                                                                                                                                                                                                                      In the end, removeBlock addUpdatedBlockStatusToTaskMetrics (with an empty BlockStatus).

                                                                                                                                                                                                                                                                                                                                                      In case the block is no longer available (None), removeBlock prints out the following WARN message to the logs:

                                                                                                                                                                                                                                                                                                                                                      Asked to remove block [blockId], which does not exist\n

                                                                                                                                                                                                                                                                                                                                                      removeBlock is used when:

                                                                                                                                                                                                                                                                                                                                                      • BlockManager is requested to handleLocalReadFailure, removeRdd, removeBroadcast
                                                                                                                                                                                                                                                                                                                                                      • BlockManagerDecommissioner is requested to migrate a block
                                                                                                                                                                                                                                                                                                                                                      • BlockManagerStorageEndpoint is requested to handle a RemoveBlock message
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#removing-rdd-blocks","title":"Removing RDD Blocks
                                                                                                                                                                                                                                                                                                                                                      removeRdd(\n  rddId: Int): Int\n

                                                                                                                                                                                                                                                                                                                                                      removeRdd removes all the blocks that belong to the rddId RDD.

                                                                                                                                                                                                                                                                                                                                                      It prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                      Removing RDD [rddId]\n

                                                                                                                                                                                                                                                                                                                                                      It then requests RDD blocks from BlockInfoManager.md[] and <> (without informing the driver).

                                                                                                                                                                                                                                                                                                                                                      The number of blocks removed is the final result.

                                                                                                                                                                                                                                                                                                                                                      NOTE: It is used by BlockManagerSlaveEndpoint.md#RemoveRdd[BlockManagerSlaveEndpoint while handling RemoveRdd messages].

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#removing-all-blocks-of-broadcast-variable","title":"Removing All Blocks of Broadcast Variable
                                                                                                                                                                                                                                                                                                                                                      removeBroadcast(broadcastId: Long, tellMaster: Boolean): Int\n

                                                                                                                                                                                                                                                                                                                                                      removeBroadcast removes all the blocks of the input broadcastId broadcast.

                                                                                                                                                                                                                                                                                                                                                      Internally, it starts by printing out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                      Removing broadcast [broadcastId]\n

                                                                                                                                                                                                                                                                                                                                                      It then requests all the BlockId.md#BroadcastBlockId[BroadcastBlockId] objects that belong to the broadcastId broadcast from BlockInfoManager.md[] and <>.

                                                                                                                                                                                                                                                                                                                                                      The number of blocks removed is the final result.

                                                                                                                                                                                                                                                                                                                                                      NOTE: It is used by BlockManagerSlaveEndpoint.md#RemoveBroadcast[BlockManagerSlaveEndpoint while handling RemoveBroadcast messages].

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#external-shuffle-servers-address","title":"External Shuffle Server's Address
                                                                                                                                                                                                                                                                                                                                                      shuffleServerId: BlockManagerId\n

                                                                                                                                                                                                                                                                                                                                                      When requested to initialize, BlockManager records the location (BlockManagerId) of External Shuffle Service if enabled or simply uses the non-external-shuffle-service BlockManagerId.

                                                                                                                                                                                                                                                                                                                                                      The BlockManagerId is used to register an executor with a local external shuffle service.

                                                                                                                                                                                                                                                                                                                                                      The BlockManagerId is used as the location of a shuffle map output when:

                                                                                                                                                                                                                                                                                                                                                      • BypassMergeSortShuffleWriter is requested to write partition records to a shuffle file
                                                                                                                                                                                                                                                                                                                                                      • UnsafeShuffleWriter is requested to close and write output
                                                                                                                                                                                                                                                                                                                                                      • SortShuffleWriter is requested to write output
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#getstatus","title":"getStatus
                                                                                                                                                                                                                                                                                                                                                      getStatus(\n  blockId: BlockId): Option[BlockStatus]\n

                                                                                                                                                                                                                                                                                                                                                      getStatus...FIXME

                                                                                                                                                                                                                                                                                                                                                      getStatus is used when BlockManagerSlaveEndpoint is requested to handle GetBlockStatus message.

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#re-registering-blockmanager-with-driver","title":"Re-registering BlockManager with Driver
                                                                                                                                                                                                                                                                                                                                                      reregister(): Unit\n

                                                                                                                                                                                                                                                                                                                                                      reregister prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                      BlockManager [blockManagerId] re-registering with master\n

                                                                                                                                                                                                                                                                                                                                                      reregister requests the BlockManagerMaster to register this BlockManager.

                                                                                                                                                                                                                                                                                                                                                      In the end, reregister reportAllBlocks.

                                                                                                                                                                                                                                                                                                                                                      reregister is used when:

                                                                                                                                                                                                                                                                                                                                                      • Executor is requested to reportHeartBeat (and informed to re-register)
                                                                                                                                                                                                                                                                                                                                                      • BlockManager is requested to asyncReregister
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#reporting-all-blocks","title":"Reporting All Blocks
                                                                                                                                                                                                                                                                                                                                                      reportAllBlocks(): Unit\n

                                                                                                                                                                                                                                                                                                                                                      reportAllBlocks prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                      Reporting [n] blocks to the master.\n

                                                                                                                                                                                                                                                                                                                                                      For all the blocks in the BlockInfoManager, reportAllBlocks getCurrentBlockStatus and tryToReportBlockStatus (for blocks tracked by the master).

                                                                                                                                                                                                                                                                                                                                                      reportAllBlocks prints out the following ERROR message to the logs and exits when block status reporting fails for any block:

                                                                                                                                                                                                                                                                                                                                                      Failed to report [blockId] to master; giving up.\n
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#calculate-current-block-status","title":"Calculate Current Block Status
                                                                                                                                                                                                                                                                                                                                                      getCurrentBlockStatus(\n  blockId: BlockId,\n  info: BlockInfo): BlockStatus\n

                                                                                                                                                                                                                                                                                                                                                      getCurrentBlockStatus gives the current BlockStatus of the BlockId block (with the block's current StorageLevel.md[StorageLevel], memory and disk sizes). It uses MemoryStore.md[MemoryStore] and DiskStore.md[DiskStore] for size and other information.

                                                                                                                                                                                                                                                                                                                                                      NOTE: Most of the information to build BlockStatus is already in BlockInfo except that it may not necessarily reflect the current state per MemoryStore.md[MemoryStore] and DiskStore.md[DiskStore].

                                                                                                                                                                                                                                                                                                                                                      Internally, it uses the input BlockInfo.md[] to know about the block's storage level. If the storage level is not set (i.e. null), the returned BlockStatus assumes the StorageLevel.md[default NONE storage level] and the memory and disk sizes being 0.

                                                                                                                                                                                                                                                                                                                                                      If however the storage level is set, getCurrentBlockStatus uses MemoryStore.md[MemoryStore] and DiskStore.md[DiskStore] to check whether the block is stored in the storages or not and request for their sizes in the storages respectively (using their getSize or assume 0).

                                                                                                                                                                                                                                                                                                                                                      NOTE: It is acceptable that the BlockInfo says to use memory or disk yet the block is not in the storages (yet or anymore). The method will give current status.

                                                                                                                                                                                                                                                                                                                                                      getCurrentBlockStatus is used when <>, <> or <> or <>.","text":""},{"location":"storage/BlockManager/#reporting-current-storage-status-of-block-to-driver","title":"Reporting Current Storage Status of Block to Driver

                                                                                                                                                                                                                                                                                                                                                      reportBlockStatus(\n  blockId: BlockId,\n  status: BlockStatus,\n  droppedMemorySize: Long = 0L): Unit\n

                                                                                                                                                                                                                                                                                                                                                      reportBlockStatus tryToReportBlockStatus.

                                                                                                                                                                                                                                                                                                                                                      If told to re-register, reportBlockStatus prints out the following INFO message to the logs followed by asynchronous re-registration:

                                                                                                                                                                                                                                                                                                                                                      Got told to re-register updating block [blockId]\n

                                                                                                                                                                                                                                                                                                                                                      In the end, reportBlockStatus prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                      Told master about block [blockId]\n

                                                                                                                                                                                                                                                                                                                                                      reportBlockStatus is used when:

                                                                                                                                                                                                                                                                                                                                                      • IndexShuffleBlockResolver is requested to
                                                                                                                                                                                                                                                                                                                                                      • BlockStoreUpdater is requested to save
                                                                                                                                                                                                                                                                                                                                                      • BlockManager is requested to getLocalBlockData, doPutIterator, dropFromMemory, removeBlockInternal
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#reporting-block-status-update-to-driver","title":"Reporting Block Status Update to Driver
                                                                                                                                                                                                                                                                                                                                                      tryToReportBlockStatus(\n  blockId: BlockId,\n  status: BlockStatus,\n  droppedMemorySize: Long = 0L): Boolean\n

                                                                                                                                                                                                                                                                                                                                                      tryToReportBlockStatus reports block status update to the BlockManagerMaster and returns its response.

                                                                                                                                                                                                                                                                                                                                                      tryToReportBlockStatus is used when:

                                                                                                                                                                                                                                                                                                                                                      • BlockManager is requested to reportAllBlocks, reportBlockStatus
                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#execution-context","title":"Execution Context

                                                                                                                                                                                                                                                                                                                                                      block-manager-future is the execution context for...FIXME

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#bytebuffer","title":"ByteBuffer

                                                                                                                                                                                                                                                                                                                                                      The underlying abstraction for blocks in Spark is a ByteBuffer that limits the size of a block to 2GB (Integer.MAX_VALUE - see Why does FileChannel.map take up to Integer.MAX_VALUE of data? and SPARK-1476 2GB limit in spark for blocks). This has implication not just for managed blocks in use, but also for shuffle blocks (memory mapped blocks are limited to 2GB, even though the API allows for long), ser-deser via byte array-backed output streams.

                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/BlockManager/#blockresult","title":"BlockResult

                                                                                                                                                                                                                                                                                                                                                      BlockResult is a metadata of a fetched block:

                                                                                                                                                                                                                                                                                                                                                      • Data (Iterator[Any])
                                                                                                                                                                                                                                                                                                                                                      • DataReadMethod
                                                                                                                                                                                                                                                                                                                                                      • Size (bytes)

                                                                                                                                                                                                                                                                                                                                                        BlockResult is created and returned when BlockManager is requested for the following:

                                                                                                                                                                                                                                                                                                                                                        • getOrElseUpdate
                                                                                                                                                                                                                                                                                                                                                        • get
                                                                                                                                                                                                                                                                                                                                                        • getLocalValues
                                                                                                                                                                                                                                                                                                                                                        • getRemoteValues
                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#datareadmethod","title":"DataReadMethod

                                                                                                                                                                                                                                                                                                                                                        DataReadMethod describes how block data was read.

                                                                                                                                                                                                                                                                                                                                                        DataReadMethod Source Disk DiskStore (while getLocalValues) Hadoop seems unused Memory MemoryStore (while getLocalValues) Network Remote BlockManagers (aka network)","text":""},{"location":"storage/BlockManager/#registering-task","title":"Registering Task
                                                                                                                                                                                                                                                                                                                                                        registerTask(\n  taskAttemptId: Long): Unit\n

                                                                                                                                                                                                                                                                                                                                                        registerTask requests the BlockInfoManager to register a given task.

                                                                                                                                                                                                                                                                                                                                                        registerTask is used when Task is requested to run (at the start of a task).

                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#creating-diskblockobjectwriter","title":"Creating DiskBlockObjectWriter
                                                                                                                                                                                                                                                                                                                                                        getDiskWriter(\n  blockId: BlockId,\n  file: File,\n  serializerInstance: SerializerInstance,\n  bufferSize: Int,\n  writeMetrics: ShuffleWriteMetrics): DiskBlockObjectWriter\n

                                                                                                                                                                                                                                                                                                                                                        getDiskWriter creates a DiskBlockObjectWriter (with spark.shuffle.sync configuration property for syncWrites argument).

                                                                                                                                                                                                                                                                                                                                                        getDiskWriter uses the SerializerManager.

                                                                                                                                                                                                                                                                                                                                                        getDiskWriter is used when:

                                                                                                                                                                                                                                                                                                                                                        • BypassMergeSortShuffleWriter is requested to write records (of a partition)

                                                                                                                                                                                                                                                                                                                                                        • ShuffleExternalSorter is requested to writeSortedFile

                                                                                                                                                                                                                                                                                                                                                        • ExternalAppendOnlyMap is requested to spillMemoryIteratorToDisk

                                                                                                                                                                                                                                                                                                                                                        • ExternalSorter is requested to spillMemoryIteratorToDisk and writePartitionedFile

                                                                                                                                                                                                                                                                                                                                                        • UnsafeSorterSpillWriter is created

                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#recording-updated-blockstatus-in-taskmetrics-of-current-task","title":"Recording Updated BlockStatus in TaskMetrics (of Current Task)
                                                                                                                                                                                                                                                                                                                                                        addUpdatedBlockStatusToTaskMetrics(\n  blockId: BlockId,\n  status: BlockStatus): Unit\n

                                                                                                                                                                                                                                                                                                                                                        addUpdatedBlockStatusToTaskMetrics takes an active TaskContext (if available) and records updated BlockStatus for Block (in the task's TaskMetrics).

                                                                                                                                                                                                                                                                                                                                                        addUpdatedBlockStatusToTaskMetrics is used when BlockManager doPutBytes (for a block that was successfully stored), doPut, doPutIterator, removes blocks from memory (possibly spilling it to disk) and removes block from memory and disk.

                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#shuffle-metrics-source","title":"Shuffle Metrics Source
                                                                                                                                                                                                                                                                                                                                                        shuffleMetricsSource: Source\n

                                                                                                                                                                                                                                                                                                                                                        shuffleMetricsSource creates a ShuffleMetricsSource with the shuffleMetrics (of the BlockStoreClient) and the source name as follows:

                                                                                                                                                                                                                                                                                                                                                        • ExternalShuffle when ExternalBlockStoreClient is specified
                                                                                                                                                                                                                                                                                                                                                        • NettyBlockTransfer otherwise

                                                                                                                                                                                                                                                                                                                                                        shuffleMetricsSource is available using SparkEnv:

                                                                                                                                                                                                                                                                                                                                                        env.blockManager.shuffleMetricsSource\n

                                                                                                                                                                                                                                                                                                                                                        shuffleMetricsSource is used when:

                                                                                                                                                                                                                                                                                                                                                        • Executor is created (for non-local / cluster modes)
                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#replicating-block-to-peers","title":"Replicating Block To Peers
                                                                                                                                                                                                                                                                                                                                                        replicate(\n  blockId: BlockId,\n  data: BlockData,\n  level: StorageLevel,\n  classTag: ClassTag[_],\n  existingReplicas: Set[BlockManagerId] = Set.empty): Unit\n

                                                                                                                                                                                                                                                                                                                                                        replicate...FIXME

                                                                                                                                                                                                                                                                                                                                                        replicate is used when BlockManager is requested to doPutBytes, doPutIterator and replicateBlock.

                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#replicateblock","title":"replicateBlock
                                                                                                                                                                                                                                                                                                                                                        replicateBlock(\n  blockId: BlockId,\n  existingReplicas: Set[BlockManagerId],\n  maxReplicas: Int): Unit\n

                                                                                                                                                                                                                                                                                                                                                        replicateBlock...FIXME

                                                                                                                                                                                                                                                                                                                                                        replicateBlock is used when BlockManagerSlaveEndpoint is requested to handle a ReplicateBlock message.

                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#putiterator","title":"putIterator
                                                                                                                                                                                                                                                                                                                                                        putIterator[T: ClassTag](\n  blockId: BlockId,\n  values: Iterator[T],\n  level: StorageLevel,\n  tellMaster: Boolean = true): Boolean\n

                                                                                                                                                                                                                                                                                                                                                        putIterator...FIXME

                                                                                                                                                                                                                                                                                                                                                        putIterator is used when:

                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to putSingle
                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#putsingle","title":"putSingle
                                                                                                                                                                                                                                                                                                                                                        putSingle[T: ClassTag](\n  blockId: BlockId,\n  value: T,\n  level: StorageLevel,\n  tellMaster: Boolean = true): Boolean\n

                                                                                                                                                                                                                                                                                                                                                        putSingle...FIXME

                                                                                                                                                                                                                                                                                                                                                        putSingle is used when TorrentBroadcast is requested to write the blocks and readBroadcastBlock.

                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#doputiterator","title":"doPutIterator
                                                                                                                                                                                                                                                                                                                                                        doPutIterator[T](\n  blockId: BlockId,\n  iterator: () => Iterator[T],\n  level: StorageLevel,\n  classTag: ClassTag[T],\n  tellMaster: Boolean = true,\n  keepReadLock: Boolean = false): Option[PartiallyUnrolledIterator[T]]\n

                                                                                                                                                                                                                                                                                                                                                        doPutIterator doPut with the putBody function.

                                                                                                                                                                                                                                                                                                                                                        doPutIterator is used when:

                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to getOrElseUpdate and putIterator
                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#putbody","title":"putBody
                                                                                                                                                                                                                                                                                                                                                        putBody: BlockInfo => Option[T]\n

                                                                                                                                                                                                                                                                                                                                                        For the given StorageLevel that indicates to use memory for storage, putBody requests the MemoryStore to putIteratorAsValues or putIteratorAsBytes based on the StorageLevel (that indicates to use deserialized format or not, respectively).

                                                                                                                                                                                                                                                                                                                                                        In case storing the block in memory was not possible (due to lack of available memory), putBody prints out the following WARN message to the logs and falls back on the DiskStore to store the block.

                                                                                                                                                                                                                                                                                                                                                        Persisting block [blockId] to disk instead.\n

                                                                                                                                                                                                                                                                                                                                                        For the given StorageLevel that indicates to use disk storage only (useMemory flag is disabled), putBody requests the DiskStore to store the block.

                                                                                                                                                                                                                                                                                                                                                        putBody gets the current block status and checks whether the StorageLevel is valid (that indicates that the block was stored successfully).

                                                                                                                                                                                                                                                                                                                                                        If the block was stored successfully, putBody reports the block status (only if indicated by the the given tellMaster flag and the tellMaster flag of the associated BlockInfo) and addUpdatedBlockStatusToTaskMetrics.

                                                                                                                                                                                                                                                                                                                                                        putBody prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                        Put block [blockId] locally took [duration] ms\n

                                                                                                                                                                                                                                                                                                                                                        For the given StorageLevel with replication enabled (above 1), putBody doGetLocalBytes and replicates the block (to other BlockManagers). putBody prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                        Put block [blockId] remotely took [duration] ms\n
                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#dogetlocalbytes","title":"doGetLocalBytes
                                                                                                                                                                                                                                                                                                                                                        doGetLocalBytes(\n  blockId: BlockId,\n  info: BlockInfo): BlockData\n

                                                                                                                                                                                                                                                                                                                                                        doGetLocalBytes...FIXME

                                                                                                                                                                                                                                                                                                                                                        doGetLocalBytes\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to getLocalBytes, doPutIterator and replicateBlock
                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#dropping-block-from-memory","title":"Dropping Block from Memory
                                                                                                                                                                                                                                                                                                                                                        dropFromMemory(\n  blockId: BlockId,\n  data: () => Either[Array[T], ChunkedByteBuffer]): StorageLevel\n

                                                                                                                                                                                                                                                                                                                                                        dropFromMemory prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                        Dropping block [blockId] from memory\n

                                                                                                                                                                                                                                                                                                                                                        dropFromMemory requests the BlockInfoManager to assert that the block is locked for writing (that gives a BlockInfo or throws a SparkException).

                                                                                                                                                                                                                                                                                                                                                        dropFromMemory drops to disk if the current storage level requires so (based on the given BlockInfo) and the block is not in the DiskStore already. dropFromMemory prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                        Writing block [blockId] to disk\n

                                                                                                                                                                                                                                                                                                                                                        dropFromMemory uses the given data to determine whether the DiskStore is requested to put or putBytes (Array[T] or ChunkedByteBuffer, respectively).

                                                                                                                                                                                                                                                                                                                                                        dropFromMemory requests the MemoryStore to remove the block. dropFromMemory prints out the following WARN message to the logs if the block was not found in the MemoryStore:

                                                                                                                                                                                                                                                                                                                                                        Block [blockId] could not be dropped from memory as it does not exist\n

                                                                                                                                                                                                                                                                                                                                                        dropFromMemory gets the current block status and reportBlockStatus when requested (when the tellMaster flag of the BlockInfo is turned on).

                                                                                                                                                                                                                                                                                                                                                        dropFromMemory addUpdatedBlockStatusToTaskMetrics when the block has been updated (dropped to disk or removed from the MemoryStore).

                                                                                                                                                                                                                                                                                                                                                        In the end, dropFromMemory returns the current StorageLevel of the block (off the BlockStatus).

                                                                                                                                                                                                                                                                                                                                                        dropFromMemory is part of the BlockEvictionHandler abstraction.

                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#releaselock-method","title":"releaseLock Method
                                                                                                                                                                                                                                                                                                                                                        releaseLock(\n  blockId: BlockId,\n  taskAttemptId: Option[Long] = None): Unit\n

                                                                                                                                                                                                                                                                                                                                                        releaseLock requests the BlockInfoManager to unlock the given block.

                                                                                                                                                                                                                                                                                                                                                        releaseLock is part of the BlockDataManager abstraction.

                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#putblockdataasstream","title":"putBlockDataAsStream
                                                                                                                                                                                                                                                                                                                                                        putBlockDataAsStream(\n  blockId: BlockId,\n  level: StorageLevel,\n  classTag: ClassTag[_]): StreamCallbackWithID\n

                                                                                                                                                                                                                                                                                                                                                        putBlockDataAsStream is part of the BlockDataManager abstraction.

                                                                                                                                                                                                                                                                                                                                                        putBlockDataAsStream...FIXME

                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#maximum-memory","title":"Maximum Memory

                                                                                                                                                                                                                                                                                                                                                        Total maximum value that BlockManager can ever possibly use (that depends on MemoryManager and may vary over time).

                                                                                                                                                                                                                                                                                                                                                        Total available on-heap and off-heap memory for storage (in bytes)

                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#maximum-off-heap-memory","title":"Maximum Off-Heap Memory","text":""},{"location":"storage/BlockManager/#maximum-on-heap-memory","title":"Maximum On-Heap Memory","text":""},{"location":"storage/BlockManager/#decommissionself","title":"decommissionSelf
                                                                                                                                                                                                                                                                                                                                                        decommissionSelf(): Unit\n

                                                                                                                                                                                                                                                                                                                                                        decommissionSelf...FIXME

                                                                                                                                                                                                                                                                                                                                                        decommissionSelf is used when:

                                                                                                                                                                                                                                                                                                                                                        • BlockManagerStorageEndpoint is requested to handle a DecommissionBlockManager message
                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#decommissionblockmanager","title":"decommissionBlockManager
                                                                                                                                                                                                                                                                                                                                                        decommissionBlockManager(): Unit\n

                                                                                                                                                                                                                                                                                                                                                        decommissionBlockManager sends a DecommissionBlockManager message to the BlockManagerStorageEndpoint.

                                                                                                                                                                                                                                                                                                                                                        decommissionBlockManager is used when:

                                                                                                                                                                                                                                                                                                                                                        • CoarseGrainedExecutorBackend is requested to decommissionSelf
                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#blockmanagerstorageendpoint","title":"BlockManagerStorageEndpoint
                                                                                                                                                                                                                                                                                                                                                        storageEndpoint: RpcEndpointRef\n

                                                                                                                                                                                                                                                                                                                                                        BlockManager sets up a RpcEndpointRef (within the RpcEnv) under the name BlockManagerEndpoint[ID] with a BlockManagerStorageEndpoint message handler.

                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#blockmanagerdecommissioner","title":"BlockManagerDecommissioner
                                                                                                                                                                                                                                                                                                                                                        decommissioner: Option[BlockManagerDecommissioner]\n

                                                                                                                                                                                                                                                                                                                                                        BlockManager defines decommissioner internal registry for a BlockManagerDecommissioner.

                                                                                                                                                                                                                                                                                                                                                        decommissioner is undefined (None) by default.

                                                                                                                                                                                                                                                                                                                                                        BlockManager creates and starts a BlockManagerDecommissioner when requested to decommissionSelf.

                                                                                                                                                                                                                                                                                                                                                        decommissioner is used for isDecommissioning and lastMigrationInfo.

                                                                                                                                                                                                                                                                                                                                                        BlockManager requests the BlockManagerDecommissioner to stop when stopped.

                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#removing-block-from-memory-and-disk","title":"Removing Block from Memory and Disk
                                                                                                                                                                                                                                                                                                                                                        removeBlockInternal(\n  blockId: BlockId,\n  tellMaster: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                        For tellMaster turned on, removeBlockInternal requests the BlockInfoManager to assert that the block is locked for writing and remembers the current block status. Otherwise, removeBlockInternal leaves the block status undetermined.

                                                                                                                                                                                                                                                                                                                                                        removeBlockInternal requests the MemoryStore to remove the block.

                                                                                                                                                                                                                                                                                                                                                        removeBlockInternal requests the DiskStore to remove the block.

                                                                                                                                                                                                                                                                                                                                                        removeBlockInternal requests the BlockInfoManager to remove the block metadata.

                                                                                                                                                                                                                                                                                                                                                        In the end, removeBlockInternal reports the block status (to the master) with the storage level changed to NONE.

                                                                                                                                                                                                                                                                                                                                                        removeBlockInternal prints out the following WARN message when the block was not stored in the MemoryStore and the DiskStore:

                                                                                                                                                                                                                                                                                                                                                        Block [blockId] could not be removed as it was not found on disk or in memory\n

                                                                                                                                                                                                                                                                                                                                                        removeBlockInternal is used when:

                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to put a new block and remove a block
                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#maybecachediskbytesinmemory","title":"maybeCacheDiskBytesInMemory
                                                                                                                                                                                                                                                                                                                                                        maybeCacheDiskBytesInMemory(\n  blockInfo: BlockInfo,\n  blockId: BlockId,\n  level: StorageLevel,\n  diskData: BlockData): Option[ChunkedByteBuffer]\n

                                                                                                                                                                                                                                                                                                                                                        maybeCacheDiskBytesInMemory...FIXME

                                                                                                                                                                                                                                                                                                                                                        maybeCacheDiskBytesInMemory is used when:

                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to getLocalValues and doGetLocalBytes
                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManager/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                        Enable ALL logging level for org.apache.spark.storage.BlockManager logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                        Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                        log4j.logger.org.apache.spark.storage.BlockManager=ALL\n

                                                                                                                                                                                                                                                                                                                                                        Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManagerDecommissioner/","title":"BlockManagerDecommissioner","text":"

                                                                                                                                                                                                                                                                                                                                                        BlockManagerDecommissioner is a decommissioning process used by BlockManager.

                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerDecommissioner/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                        BlockManagerDecommissioner takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                        • SparkConf
                                                                                                                                                                                                                                                                                                                                                        • BlockManager

                                                                                                                                                                                                                                                                                                                                                          BlockManagerDecommissioner is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                          • BlockManager is requested to decommissionSelf
                                                                                                                                                                                                                                                                                                                                                          "},{"location":"storage/BlockManagerId/","title":"BlockManagerId","text":"

                                                                                                                                                                                                                                                                                                                                                          BlockManagerId is a unique identifier (address) of a BlockManager.

                                                                                                                                                                                                                                                                                                                                                          "},{"location":"storage/BlockManagerInfo/","title":"BlockManagerInfo","text":"

                                                                                                                                                                                                                                                                                                                                                          = BlockManagerInfo

                                                                                                                                                                                                                                                                                                                                                          BlockManagerInfo is...FIXME

                                                                                                                                                                                                                                                                                                                                                          "},{"location":"storage/BlockManagerMaster/","title":"BlockManagerMaster","text":"

                                                                                                                                                                                                                                                                                                                                                          BlockManagerMaster runs on the driver and executors to exchange block metadata (status and locations) in a Spark application.

                                                                                                                                                                                                                                                                                                                                                          BlockManagerMaster uses BlockManagerMasterEndpoint (registered as BlockManagerMaster RPC endpoint on the driver with the endpoint references on executors) for executors to send block status updates and so let the driver keep track of block status and locations.

                                                                                                                                                                                                                                                                                                                                                          "},{"location":"storage/BlockManagerMaster/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                          BlockManagerMaster takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                          • Driver Endpoint
                                                                                                                                                                                                                                                                                                                                                          • Heartbeat Endpoint
                                                                                                                                                                                                                                                                                                                                                          • SparkConf
                                                                                                                                                                                                                                                                                                                                                          • isDriver flag (whether it is created for the driver or executors)

                                                                                                                                                                                                                                                                                                                                                            BlockManagerMaster is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                            • SparkEnv utility is used to create a SparkEnv (and create a BlockManager)
                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMaster/#driver-endpoint","title":"Driver Endpoint

                                                                                                                                                                                                                                                                                                                                                            BlockManagerMaster is given a RpcEndpointRef of the BlockManagerMaster RPC Endpoint (on the driver) when created.

                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockManagerMaster/#heartbeat-endpoint","title":"Heartbeat Endpoint

                                                                                                                                                                                                                                                                                                                                                            BlockManagerMaster is given a RpcEndpointRef of the BlockManagerMasterHeartbeat RPC Endpoint (on the driver) when created.

                                                                                                                                                                                                                                                                                                                                                            The endpoint is used (mainly) when:

                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to executorHeartbeatReceived
                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockManagerMaster/#registering-blockmanager-on-executor-with-driver","title":"Registering BlockManager (on Executor) with Driver
                                                                                                                                                                                                                                                                                                                                                            registerBlockManager(\n  id: BlockManagerId,\n  localDirs: Array[String],\n  maxOnHeapMemSize: Long,\n  maxOffHeapMemSize: Long,\n  storageEndpoint: RpcEndpointRef): BlockManagerId\n

                                                                                                                                                                                                                                                                                                                                                            registerBlockManager prints out the following INFO message to the logs (with the given BlockManagerId):

                                                                                                                                                                                                                                                                                                                                                            Registering BlockManager [id]\n

                                                                                                                                                                                                                                                                                                                                                            registerBlockManager notifies the driver (using the BlockManagerMaster RPC endpoint) that the BlockManagerId wants to register (and sends a blocking RegisterBlockManager message).

                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                            The input maxMemSize is the total available on-heap and off-heap memory for storage on the BlockManager.

                                                                                                                                                                                                                                                                                                                                                            registerBlockManager waits until a confirmation comes (as a possibly-updated BlockManagerId).

                                                                                                                                                                                                                                                                                                                                                            In the end, registerBlockManager prints out the following INFO message to the logs and returns the BlockManagerId received.

                                                                                                                                                                                                                                                                                                                                                            Registered BlockManager [updatedId]\n

                                                                                                                                                                                                                                                                                                                                                            registerBlockManager\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to initialize and reregister
                                                                                                                                                                                                                                                                                                                                                            • FallbackStorage utility is used to registerBlockManagerIfNeeded
                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockManagerMaster/#finding-block-locations-for-single-block","title":"Finding Block Locations for Single Block
                                                                                                                                                                                                                                                                                                                                                            getLocations(\n  blockId: BlockId): Seq[BlockManagerId]\n

                                                                                                                                                                                                                                                                                                                                                            getLocations requests the driver (using the BlockManagerMaster RPC endpoint) for BlockManagerIds of the given BlockId (and sends a blocking GetLocations message).

                                                                                                                                                                                                                                                                                                                                                            getLocations\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to fetchRemoteManagedBuffer
                                                                                                                                                                                                                                                                                                                                                            • BlockManagerMaster is requested to contains a BlockId
                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockManagerMaster/#finding-block-locations-for-multiple-blocks","title":"Finding Block Locations for Multiple Blocks
                                                                                                                                                                                                                                                                                                                                                            getLocations(\n  blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]]\n

                                                                                                                                                                                                                                                                                                                                                            getLocations requests the driver (using the BlockManagerMaster RPC endpoint) for BlockManagerIds of the given BlockIds (and sends a blocking GetLocationsMultipleBlockIds message).

                                                                                                                                                                                                                                                                                                                                                            getLocations\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested for BlockManagers (executors) for cached RDD partitions
                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to getLocationBlockIds
                                                                                                                                                                                                                                                                                                                                                            • BlockManager utility is used to blockIdsToLocations
                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockManagerMaster/#contains","title":"contains
                                                                                                                                                                                                                                                                                                                                                            contains(\n  blockId: BlockId): Boolean\n

                                                                                                                                                                                                                                                                                                                                                            contains is positive (true) when there is at least one executor with the given BlockId.

                                                                                                                                                                                                                                                                                                                                                            contains\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                            • LocalRDDCheckpointData is requested to doCheckpoint
                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockManagerMaster/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                            Enable ALL logging level for org.apache.spark.storage.BlockManagerMaster logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                            Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                            log4j.logger.org.apache.spark.storage.BlockManagerMaster=ALL\n

                                                                                                                                                                                                                                                                                                                                                            Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockManagerMasterEndpoint/","title":"BlockManagerMasterEndpoint","text":"

                                                                                                                                                                                                                                                                                                                                                            BlockManagerMasterEndpoint is a rpc:RpcEndpoint.md#ThreadSafeRpcEndpoint[ThreadSafeRpcEndpoint] for storage:BlockManagerMaster.md[BlockManagerMaster].

                                                                                                                                                                                                                                                                                                                                                            BlockManagerMasterEndpoint is registered under BlockManagerMaster name.

                                                                                                                                                                                                                                                                                                                                                            BlockManagerMasterEndpoint tracks status of the storage:BlockManager.md[BlockManagers] (on the executors) in a Spark application.

                                                                                                                                                                                                                                                                                                                                                            == [[creating-instance]] Creating Instance

                                                                                                                                                                                                                                                                                                                                                            BlockManagerMasterEndpoint takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                            • [[rpcEnv]] rpc:RpcEnv.md[]
                                                                                                                                                                                                                                                                                                                                                            • [[isLocal]] Flag whether BlockManagerMasterEndpoint works in local or cluster mode
                                                                                                                                                                                                                                                                                                                                                            • [[conf]] SparkConf.md[]
                                                                                                                                                                                                                                                                                                                                                            • [[listenerBus]] scheduler:LiveListenerBus.md[]

                                                                                                                                                                                                                                                                                                                                                            BlockManagerMasterEndpoint is created for the core:SparkEnv.md#create[SparkEnv] on the driver (to create a storage:BlockManagerMaster.md[] for a storage:BlockManager.md#master[BlockManager]).

                                                                                                                                                                                                                                                                                                                                                            When created, BlockManagerMasterEndpoint prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#blockmanagermasterendpoint-up","title":"BlockManagerMasterEndpoint up","text":"

                                                                                                                                                                                                                                                                                                                                                            == [[messages]][[receiveAndReply]] Messages

                                                                                                                                                                                                                                                                                                                                                            As an rpc:RpcEndpoint.md[], BlockManagerMasterEndpoint handles RPC messages.

                                                                                                                                                                                                                                                                                                                                                            === [[BlockManagerHeartbeat]] BlockManagerHeartbeat

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            BlockManagerHeartbeat( blockManagerId: BlockManagerId)

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                            Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                            === [[GetLocations]] GetLocations

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            GetLocations( blockId: BlockId)

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint replies with the <> of blockId.

                                                                                                                                                                                                                                                                                                                                                            Posted when BlockManagerMaster.md#getLocations-block[BlockManagerMaster requests the block locations of a single block].

                                                                                                                                                                                                                                                                                                                                                            === [[GetLocationsAndStatus]] GetLocationsAndStatus

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            GetLocationsAndStatus( blockId: BlockId)

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                            Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                            === [[GetLocationsMultipleBlockIds]] GetLocationsMultipleBlockIds

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            GetLocationsMultipleBlockIds( blockIds: Array[BlockId])

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint replies with the <> for the given storage:BlockId.md[].

                                                                                                                                                                                                                                                                                                                                                            Posted when BlockManagerMaster.md#getLocations[BlockManagerMaster requests the block locations for multiple blocks].

                                                                                                                                                                                                                                                                                                                                                            === [[GetPeers]] GetPeers

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_4","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            GetPeers( blockManagerId: BlockManagerId)

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint replies with the <> of blockManagerId.

                                                                                                                                                                                                                                                                                                                                                            Peers of a storage:BlockManager.md[BlockManager] are the other BlockManagers in a cluster (except the driver's BlockManager). Peers are used to know the available executors in a Spark application.

                                                                                                                                                                                                                                                                                                                                                            Posted when BlockManagerMaster.md#getPeers[BlockManagerMaster requests the peers of a BlockManager].

                                                                                                                                                                                                                                                                                                                                                            === [[GetExecutorEndpointRef]] GetExecutorEndpointRef

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_5","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            GetExecutorEndpointRef( executorId: String)

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                            Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                            === [[GetMemoryStatus]] GetMemoryStatus

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_6","title":"[source, scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#getmemorystatus","title":"GetMemoryStatus","text":"

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                            Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                            === [[GetStorageStatus]] GetStorageStatus

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_7","title":"[source, scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#getstoragestatus","title":"GetStorageStatus","text":"

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                            Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                            === [[GetBlockStatus]] GetBlockStatus

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_8","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            GetBlockStatus( blockId: BlockId, askSlaves: Boolean = true)

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint is requested to <>.

                                                                                                                                                                                                                                                                                                                                                            Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                            === [[GetMatchingBlockIds]] GetMatchingBlockIds

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_9","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            GetMatchingBlockIds( filter: BlockId => Boolean, askSlaves: Boolean = true)

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                            Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                            === [[HasCachedBlocks]] HasCachedBlocks

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_10","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            HasCachedBlocks( executorId: String)

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                            Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                            === [[RegisterBlockManager]] RegisterBlockManager

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala","title":"[source,scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            RegisterBlockManager( blockManagerId: BlockManagerId, maxOnHeapMemSize: Long, maxOffHeapMemSize: Long, sender: RpcEndpointRef)

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint is requested to <> (by the given storage:BlockManagerId.md[]).

                                                                                                                                                                                                                                                                                                                                                            Posted when BlockManagerMaster is requested to storage:BlockManagerMaster.md#registerBlockManager[register a BlockManager]

                                                                                                                                                                                                                                                                                                                                                            === [[RemoveRdd]] RemoveRdd

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_11","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            RemoveRdd( rddId: Int)

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                            Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                            === [[RemoveShuffle]] RemoveShuffle

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_12","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            RemoveShuffle( shuffleId: Int)

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                            Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                            === [[RemoveBroadcast]] RemoveBroadcast

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_13","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            RemoveBroadcast( broadcastId: Long, removeFromDriver: Boolean = true)

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                            Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                            === [[RemoveBlock]] RemoveBlock

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_14","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            RemoveBlock( blockId: BlockId)

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                            Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                            === [[RemoveExecutor]] RemoveExecutor

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_15","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            RemoveExecutor( execId: String)

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint <execId is removed>> and the response true sent back.

                                                                                                                                                                                                                                                                                                                                                            Posted when BlockManagerMaster.md#removeExecutor[BlockManagerMaster removes an executor].

                                                                                                                                                                                                                                                                                                                                                            === [[StopBlockManagerMaster]] StopBlockManagerMaster

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_16","title":"[source, scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#stopblockmanagermaster","title":"StopBlockManagerMaster","text":"

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                            Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                            === [[UpdateBlockInfo]] UpdateBlockInfo

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_17","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            UpdateBlockInfo( blockManagerId: BlockManagerId, blockId: BlockId, storageLevel: StorageLevel, memSize: Long, diskSize: Long)

                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                            Posted when BlockManagerMaster is requested to storage:BlockManagerMaster.md#updateBlockInfo[handle a block status update (from BlockManager on an executor)].

                                                                                                                                                                                                                                                                                                                                                            == [[storageStatus]] storageStatus Internal Method

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_1","title":"[source,scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#storagestatus-arraystoragestatus","title":"storageStatus: Array[StorageStatus]","text":"

                                                                                                                                                                                                                                                                                                                                                            storageStatus...FIXME

                                                                                                                                                                                                                                                                                                                                                            storageStatus is used when BlockManagerMasterEndpoint is requested to handle <> message.

                                                                                                                                                                                                                                                                                                                                                            == [[getLocationsMultipleBlockIds]] getLocationsMultipleBlockIds Internal Method

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_2","title":"[source,scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            getLocationsMultipleBlockIds( blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]]

                                                                                                                                                                                                                                                                                                                                                            getLocationsMultipleBlockIds...FIXME

                                                                                                                                                                                                                                                                                                                                                            getLocationsMultipleBlockIds is used when BlockManagerMasterEndpoint is requested to handle <> message.

                                                                                                                                                                                                                                                                                                                                                            == [[removeShuffle]] removeShuffle Internal Method

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_3","title":"[source,scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            removeShuffle( shuffleId: Int): Future[Seq[Boolean]]

                                                                                                                                                                                                                                                                                                                                                            removeShuffle...FIXME

                                                                                                                                                                                                                                                                                                                                                            removeShuffle is used when BlockManagerMasterEndpoint is requested to handle <> message.

                                                                                                                                                                                                                                                                                                                                                            == [[getPeers]] getPeers Internal Method

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_18","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            getPeers( blockManagerId: BlockManagerId): Seq[BlockManagerId]

                                                                                                                                                                                                                                                                                                                                                            getPeers finds all the registered BlockManagers (using <> internal registry) and checks if the input blockManagerId is amongst them.

                                                                                                                                                                                                                                                                                                                                                            If the input blockManagerId is registered, getPeers returns all the registered BlockManagers but the one on the driver and blockManagerId.

                                                                                                                                                                                                                                                                                                                                                            Otherwise, getPeers returns no BlockManagers.

                                                                                                                                                                                                                                                                                                                                                            NOTE: Peers of a storage:BlockManager.md[BlockManager] are the other BlockManagers in a cluster (except the driver's BlockManager). Peers are used to know the available executors in a Spark application.

                                                                                                                                                                                                                                                                                                                                                            getPeers is used when BlockManagerMasterEndpoint is requested to handle <> message.

                                                                                                                                                                                                                                                                                                                                                            == [[register]] register Internal Method

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_19","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                            register( idWithoutTopologyInfo: BlockManagerId, maxOnHeapMemSize: Long, maxOffHeapMemSize: Long, slaveEndpoint: RpcEndpointRef): BlockManagerId

                                                                                                                                                                                                                                                                                                                                                            register registers a storage:BlockManager.md[] (based on the given storage:BlockManagerId.md[]) in the <> and <> registries and posts a SparkListenerBlockManagerAdded message (to the <>).

                                                                                                                                                                                                                                                                                                                                                            NOTE: The input maxMemSize is the storage:BlockManager.md#maxMemory[total available on-heap and off-heap memory for storage on a BlockManager].

                                                                                                                                                                                                                                                                                                                                                            NOTE: Registering a BlockManager can only happen once for an executor (identified by BlockManagerId.executorId in <> internal registry).

                                                                                                                                                                                                                                                                                                                                                            If another BlockManager has earlier been registered for the executor, you should see the following ERROR message in the logs:

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#got-two-different-block-manager-registrations-on-same-executor-will-replace-old-one-oldid-with-new-one-id","title":"Got two different block manager registrations on same executor - will replace old one [oldId] with new one [id]","text":"

                                                                                                                                                                                                                                                                                                                                                            And then <>.

                                                                                                                                                                                                                                                                                                                                                            register prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext_2","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#registering-block-manager-hostport-with-bytes-ram-id","title":"Registering block manager [hostPort] with [bytes] RAM, [id]","text":"

                                                                                                                                                                                                                                                                                                                                                            The BlockManager is recorded in the internal registries:

                                                                                                                                                                                                                                                                                                                                                            • <>
                                                                                                                                                                                                                                                                                                                                                            • <>

                                                                                                                                                                                                                                                                                                                                                              In the end, register requests the <> to scheduler:LiveListenerBus.md#post[post] a SparkListener.md#SparkListenerBlockManagerAdded[SparkListenerBlockManagerAdded] message.

                                                                                                                                                                                                                                                                                                                                                              register is used when BlockManagerMasterEndpoint is requested to handle <> message.

                                                                                                                                                                                                                                                                                                                                                              == [[removeExecutor]] removeExecutor Internal Method

                                                                                                                                                                                                                                                                                                                                                              "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_20","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                              removeExecutor( execId: String): Unit

                                                                                                                                                                                                                                                                                                                                                              removeExecutor prints the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                              "},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext_3","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#trying-to-remove-executor-execid-from-blockmanagermaster","title":"Trying to remove executor [execId] from BlockManagerMaster.","text":"

                                                                                                                                                                                                                                                                                                                                                              If the execId executor is registered (in the internal <> internal registry), removeExecutor <BlockManager>>.

                                                                                                                                                                                                                                                                                                                                                              removeExecutor is used when BlockManagerMasterEndpoint is requested to handle <> or <> messages.

                                                                                                                                                                                                                                                                                                                                                              == [[removeBlockManager]] removeBlockManager Internal Method

                                                                                                                                                                                                                                                                                                                                                              "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_21","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                              removeBlockManager( blockManagerId: BlockManagerId): Unit

                                                                                                                                                                                                                                                                                                                                                              removeBlockManager looks up blockManagerId and removes the executor it was working on from the internal registries:

                                                                                                                                                                                                                                                                                                                                                              • <>
                                                                                                                                                                                                                                                                                                                                                              • <>

                                                                                                                                                                                                                                                                                                                                                                It then goes over all the blocks for the BlockManager, and removes the executor for each block from blockLocations registry.

                                                                                                                                                                                                                                                                                                                                                                SparkListener.md#SparkListenerBlockManagerRemoved[SparkListenerBlockManagerRemoved(System.currentTimeMillis(), blockManagerId)] is posted to SparkContext.md#listenerBus[listenerBus].

                                                                                                                                                                                                                                                                                                                                                                You should then see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext_4","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#removing-block-manager-blockmanagerid","title":"Removing block manager [blockManagerId]","text":"

                                                                                                                                                                                                                                                                                                                                                                removeBlockManager is used when BlockManagerMasterEndpoint is requested to <> (to handle <> or <> messages).

                                                                                                                                                                                                                                                                                                                                                                == [[getLocations]] getLocations Internal Method

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_22","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                getLocations( blockId: BlockId): Seq[BlockManagerId]

                                                                                                                                                                                                                                                                                                                                                                getLocations looks up the given storage:BlockId.md[] in the blockLocations internal registry and returns the locations (as a collection of BlockManagerId) or an empty collection.

                                                                                                                                                                                                                                                                                                                                                                getLocations is used when BlockManagerMasterEndpoint is requested to handle <> and <> messages.

                                                                                                                                                                                                                                                                                                                                                                == [[logging]] Logging

                                                                                                                                                                                                                                                                                                                                                                Enable ALL logging level for org.apache.spark.storage.BlockManagerMasterEndpoint logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerMasterEndpoint/#source","title":"[source]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#log4jloggerorgapachesparkstorageblockmanagermasterendpointall","title":"log4j.logger.org.apache.spark.storage.BlockManagerMasterEndpoint=ALL","text":"

                                                                                                                                                                                                                                                                                                                                                                Refer to spark-logging.md[Logging].

                                                                                                                                                                                                                                                                                                                                                                == [[internal-properties]] Internal Properties

                                                                                                                                                                                                                                                                                                                                                                === [[blockManagerIdByExecutor]] blockManagerIdByExecutor Lookup Table

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_4","title":"[source,scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#blockmanageridbyexecutor-mapstring-blockmanagerid","title":"blockManagerIdByExecutor: Map[String, BlockManagerId]","text":"

                                                                                                                                                                                                                                                                                                                                                                Lookup table of storage:BlockManagerId.md[]s by executor ID

                                                                                                                                                                                                                                                                                                                                                                A new executor is added when BlockManagerMasterEndpoint is requested to handle a <> message (and <>).

                                                                                                                                                                                                                                                                                                                                                                An executor is removed when BlockManagerMasterEndpoint is requested to handle a <> and a <> messages (via <>)

                                                                                                                                                                                                                                                                                                                                                                Used when BlockManagerMasterEndpoint is requested to handle <> message, <>, <> and <>.

                                                                                                                                                                                                                                                                                                                                                                === [[blockManagerInfo]] blockManagerInfo Lookup Table

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_5","title":"[source,scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#blockmanageridbyexecutor-mapstring-blockmanagerid_1","title":"blockManagerIdByExecutor: Map[String, BlockManagerId]","text":"

                                                                                                                                                                                                                                                                                                                                                                Lookup table of storage:BlockManagerInfo.md[] by storage:BlockManagerId.md[]

                                                                                                                                                                                                                                                                                                                                                                A new BlockManagerInfo is added when BlockManagerMasterEndpoint is requested to handle a <> message (and <>).

                                                                                                                                                                                                                                                                                                                                                                A BlockManagerInfo is removed when BlockManagerMasterEndpoint is requested to <> (to handle <> and <> messages).

                                                                                                                                                                                                                                                                                                                                                                === [[blockLocations]] blockLocations

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_6","title":"[source,scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#blocklocations-mapblockid-setblockmanagerid","title":"blockLocations: Map[BlockId, Set[BlockManagerId]]","text":"

                                                                                                                                                                                                                                                                                                                                                                Collection of storage:BlockId.md[] and their locations (as BlockManagerId).

                                                                                                                                                                                                                                                                                                                                                                Used in removeRdd to remove blocks for a RDD, removeBlockManager to remove blocks after a BlockManager gets removed, removeBlockFromWorkers, updateBlockInfo, and <>."},{"location":"storage/BlockManagerMasterHeartbeatEndpoint/","title":"BlockManagerMasterHeartbeatEndpoint","text":"

                                                                                                                                                                                                                                                                                                                                                                BlockManagerMasterHeartbeatEndpoint is...FIXME

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerSlaveEndpoint/","title":"BlockManagerSlaveEndpoint","text":"

                                                                                                                                                                                                                                                                                                                                                                BlockManagerSlaveEndpoint is a ThreadSafeRpcEndpoint for BlockManager.

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerSlaveEndpoint/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                BlockManagerSlaveEndpoint takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                • [[rpcEnv]] rpc:RpcEnv.md[]
                                                                                                                                                                                                                                                                                                                                                                • [[blockManager]] Parent BlockManager.md[]
                                                                                                                                                                                                                                                                                                                                                                • [[mapOutputTracker]] scheduler:MapOutputTracker.md[]

                                                                                                                                                                                                                                                                                                                                                                BlockManagerSlaveEndpoint is created for BlockManager.md#slaveEndpoint[BlockManager] (and registered under the name BlockManagerEndpoint[ID]).

                                                                                                                                                                                                                                                                                                                                                                == [[messages]] Messages

                                                                                                                                                                                                                                                                                                                                                                === [[GetBlockStatus]] GetBlockStatus

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                GetBlockStatus( blockId: BlockId, askSlaves: Boolean = true)

                                                                                                                                                                                                                                                                                                                                                                When received, BlockManagerSlaveEndpoint requests the <> for the BlockManager.md#getStatus[status of a given block] (by BlockId.md[]) and sends it back to a sender.

                                                                                                                                                                                                                                                                                                                                                                Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                === [[GetMatchingBlockIds]] GetMatchingBlockIds

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                GetMatchingBlockIds( filter: BlockId => Boolean, askSlaves: Boolean = true)

                                                                                                                                                                                                                                                                                                                                                                When received, BlockManagerSlaveEndpoint requests the <> to storage:BlockManager.md#getMatchingBlockIds[find IDs of existing blocks for a given filter] and sends them back to a sender.

                                                                                                                                                                                                                                                                                                                                                                Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                === [[RemoveBlock]] RemoveBlock

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                RemoveBlock( blockId: BlockId)

                                                                                                                                                                                                                                                                                                                                                                When received, BlockManagerSlaveEndpoint prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerSlaveEndpoint/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerSlaveEndpoint/#removing-block-blockid","title":"removing block [blockId]","text":"

                                                                                                                                                                                                                                                                                                                                                                BlockManagerSlaveEndpoint then <blockId block>>.

                                                                                                                                                                                                                                                                                                                                                                When the computation is successful, you should see the following DEBUG in the logs:

                                                                                                                                                                                                                                                                                                                                                                Done removing block [blockId], response is [response]\n

                                                                                                                                                                                                                                                                                                                                                                And true response is sent back. You should see the following DEBUG in the logs:

                                                                                                                                                                                                                                                                                                                                                                Sent response: true to [senderAddress]\n

                                                                                                                                                                                                                                                                                                                                                                In case of failure, you should see the following ERROR in the logs and the stack trace.

                                                                                                                                                                                                                                                                                                                                                                Error in removing block [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                === [[RemoveBroadcast]] RemoveBroadcast

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                RemoveBroadcast( broadcastId: Long, removeFromDriver: Boolean = true)

                                                                                                                                                                                                                                                                                                                                                                When received, BlockManagerSlaveEndpoint prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerSlaveEndpoint/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerSlaveEndpoint/#removing-broadcast-broadcastid","title":"removing broadcast [broadcastId]","text":"

                                                                                                                                                                                                                                                                                                                                                                It then calls <broadcastId broadcast>>.

                                                                                                                                                                                                                                                                                                                                                                When the computation is successful, you should see the following DEBUG in the logs:

                                                                                                                                                                                                                                                                                                                                                                Done removing broadcast [broadcastId], response is [response]\n

                                                                                                                                                                                                                                                                                                                                                                And the result is sent back. You should see the following DEBUG in the logs:

                                                                                                                                                                                                                                                                                                                                                                Sent response: [response] to [senderAddress]\n

                                                                                                                                                                                                                                                                                                                                                                In case of failure, you should see the following ERROR in the logs and the stack trace.

                                                                                                                                                                                                                                                                                                                                                                Error in removing broadcast [broadcastId]\n

                                                                                                                                                                                                                                                                                                                                                                === [[RemoveRdd]] RemoveRdd

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_4","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                RemoveRdd( rddId: Int)

                                                                                                                                                                                                                                                                                                                                                                When received, BlockManagerSlaveEndpoint prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                removing RDD [rddId]\n

                                                                                                                                                                                                                                                                                                                                                                It then calls <rddId RDD>>.

                                                                                                                                                                                                                                                                                                                                                                NOTE: Handling RemoveRdd messages happens on a separate thread. See <>.

                                                                                                                                                                                                                                                                                                                                                                When the computation is successful, you should see the following DEBUG in the logs:

                                                                                                                                                                                                                                                                                                                                                                Done removing RDD [rddId], response is [response]\n

                                                                                                                                                                                                                                                                                                                                                                And the number of blocks removed is sent back. You should see the following DEBUG in the logs:

                                                                                                                                                                                                                                                                                                                                                                Sent response: [#blocks] to [senderAddress]\n

                                                                                                                                                                                                                                                                                                                                                                In case of failure, you should see the following ERROR in the logs and the stack trace.

                                                                                                                                                                                                                                                                                                                                                                Error in removing RDD [rddId]\n

                                                                                                                                                                                                                                                                                                                                                                === [[RemoveShuffle]] RemoveShuffle

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_5","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                RemoveShuffle( shuffleId: Int)

                                                                                                                                                                                                                                                                                                                                                                When received, BlockManagerSlaveEndpoint prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                removing shuffle [shuffleId]\n

                                                                                                                                                                                                                                                                                                                                                                If scheduler:MapOutputTracker.md[MapOutputTracker] was given (when the RPC endpoint was created), it calls scheduler:MapOutputTracker.md#unregisterShuffle[MapOutputTracker to unregister the shuffleId shuffle].

                                                                                                                                                                                                                                                                                                                                                                It then calls shuffle:ShuffleManager.md#unregisterShuffle[ShuffleManager to unregister the shuffleId shuffle].

                                                                                                                                                                                                                                                                                                                                                                NOTE: Handling RemoveShuffle messages happens on a separate thread. See <>.

                                                                                                                                                                                                                                                                                                                                                                When the computation is successful, you should see the following DEBUG in the logs:

                                                                                                                                                                                                                                                                                                                                                                Done removing shuffle [shuffleId], response is [response]\n

                                                                                                                                                                                                                                                                                                                                                                And the result is sent back. You should see the following DEBUG in the logs:

                                                                                                                                                                                                                                                                                                                                                                Sent response: [response] to [senderAddress]\n

                                                                                                                                                                                                                                                                                                                                                                In case of failure, you should see the following ERROR in the logs and the stack trace.

                                                                                                                                                                                                                                                                                                                                                                Error in removing shuffle [shuffleId]\n

                                                                                                                                                                                                                                                                                                                                                                Posted when BlockManagerMaster.md#removeShuffle[BlockManagerMaster] and storage:BlockManagerMasterEndpoint.md#removeShuffle[BlockManagerMasterEndpoint] are requested to remove all blocks of a shuffle.

                                                                                                                                                                                                                                                                                                                                                                === [[ReplicateBlock]] ReplicateBlock

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_6","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                ReplicateBlock( blockId: BlockId, replicas: Seq[BlockManagerId], maxReplicas: Int)

                                                                                                                                                                                                                                                                                                                                                                When received, BlockManagerSlaveEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                                Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                === [[TriggerThreadDump]] TriggerThreadDump

                                                                                                                                                                                                                                                                                                                                                                When received, BlockManagerSlaveEndpoint is requested for the thread info for all live threads with stack trace and synchronization information.

                                                                                                                                                                                                                                                                                                                                                                == [[asyncThreadPool]][[asyncExecutionContext]] block-manager-slave-async-thread-pool Thread Pool

                                                                                                                                                                                                                                                                                                                                                                BlockManagerSlaveEndpoint creates a thread pool of maximum 100 daemon threads with block-manager-slave-async-thread-pool thread prefix (using {java-javadoc-url}/java/util/concurrent/ThreadPoolExecutor.html[java.util.concurrent.ThreadPoolExecutor]).

                                                                                                                                                                                                                                                                                                                                                                BlockManagerSlaveEndpoint uses the thread pool (as a Scala implicit value) when requested to <> to communicate in a non-blocking, asynchronous way.

                                                                                                                                                                                                                                                                                                                                                                The thread pool is shut down when BlockManagerSlaveEndpoint is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                The reason for the async thread pool is that the block-related operations might take quite some time and to release the main RPC thread other threads are spawned to talk to the external services and pass responses on to the clients.

                                                                                                                                                                                                                                                                                                                                                                == [[doAsync]] doAsync Internal Method

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerSlaveEndpoint/#sourcescala","title":"[source,scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                doAsyncT( body: => T)

                                                                                                                                                                                                                                                                                                                                                                doAsync creates a Scala Future to execute the following asynchronously (i.e. on a separate thread from the <>):

                                                                                                                                                                                                                                                                                                                                                                . Prints out the given actionMessage as a DEBUG message to the logs

                                                                                                                                                                                                                                                                                                                                                                . Executes the given body

                                                                                                                                                                                                                                                                                                                                                                When completed successfully, doAsync prints out the following DEBUG messages to the logs and requests the given RpcCallContext to reply the response to the sender.

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerSlaveEndpoint/#sourceplaintext_2","title":"[source,plaintext]","text":"

                                                                                                                                                                                                                                                                                                                                                                Done [actionMessage], response is [response] Sent response: [response] to [senderAddress]

                                                                                                                                                                                                                                                                                                                                                                In case of a failure, doAsync prints out the following ERROR message to the logs and requests the given RpcCallContext to send the failure to the sender.

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerSlaveEndpoint/#sourceplaintext_3","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerSlaveEndpoint/#error-in-actionmessage","title":"Error in [actionMessage]","text":"

                                                                                                                                                                                                                                                                                                                                                                doAsync is used when BlockManagerSlaveEndpoint is requested to handle <>, <>, <> and <> messages.

                                                                                                                                                                                                                                                                                                                                                                == [[logging]] Logging

                                                                                                                                                                                                                                                                                                                                                                Enable ALL logging level for org.apache.spark.storage.BlockManagerSlaveEndpoint logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerSlaveEndpoint/#source","title":"[source]","text":""},{"location":"storage/BlockManagerSlaveEndpoint/#log4jloggerorgapachesparkstorageblockmanagerslaveendpointall","title":"log4j.logger.org.apache.spark.storage.BlockManagerSlaveEndpoint=ALL","text":"

                                                                                                                                                                                                                                                                                                                                                                Refer to spark-logging.md[Logging].

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerSource/","title":"BlockManagerSource -- Metrics Source for BlockManager","text":"

                                                                                                                                                                                                                                                                                                                                                                BlockManagerSource is the spark-metrics-Source.md[metrics source] of a storage:BlockManager.md[BlockManager].

                                                                                                                                                                                                                                                                                                                                                                [[sourceName]] BlockManagerSource is registered under the name BlockManager (when SparkContext is created).

                                                                                                                                                                                                                                                                                                                                                                [[metrics]] .BlockManagerSource's Gauge Metrics (in alphabetical order) [width=\"100%\",cols=\"1,1,2\",options=\"header\"] |=== | Name | Type | Description

                                                                                                                                                                                                                                                                                                                                                                | disk.diskSpaceUsed_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their disk space used (diskUsed).

                                                                                                                                                                                                                                                                                                                                                                | memory.maxMem_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their maximum memory limit (maxMem).

                                                                                                                                                                                                                                                                                                                                                                | memory.maxOffHeapMem_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their off-heap memory remaining (maxOffHeapMem).

                                                                                                                                                                                                                                                                                                                                                                | memory.maxOnHeapMem_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their on-heap memory remaining (maxOnHeapMem).

                                                                                                                                                                                                                                                                                                                                                                | memory.memUsed_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their memory used (memUsed).

                                                                                                                                                                                                                                                                                                                                                                | memory.offHeapMemUsed_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their off-heap memory used (offHeapMemUsed).

                                                                                                                                                                                                                                                                                                                                                                | memory.onHeapMemUsed_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their on-heap memory used (onHeapMemUsed).

                                                                                                                                                                                                                                                                                                                                                                | memory.remainingMem_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their memory remaining (memRemaining).

                                                                                                                                                                                                                                                                                                                                                                | memory.remainingOffHeapMem_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their off-heap memory remaining (offHeapMemRemaining).

                                                                                                                                                                                                                                                                                                                                                                | memory.remainingOnHeapMem_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their on-heap memory remaining (onHeapMemRemaining). |===

                                                                                                                                                                                                                                                                                                                                                                You can access the BlockManagerSource <> using the web UI's port (as spark-webui-properties.md#spark.ui.port[spark.ui.port] configuration property).

                                                                                                                                                                                                                                                                                                                                                                $ http --follow http://localhost:4040/metrics/json \\\n    | jq '.gauges | keys | .[] | select(test(\".driver.BlockManager\"; \"g\"))'\n\"local-1528725411625.driver.BlockManager.disk.diskSpaceUsed_MB\"\n\"local-1528725411625.driver.BlockManager.memory.maxMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.maxOffHeapMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.maxOnHeapMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.memUsed_MB\"\n\"local-1528725411625.driver.BlockManager.memory.offHeapMemUsed_MB\"\n\"local-1528725411625.driver.BlockManager.memory.onHeapMemUsed_MB\"\n\"local-1528725411625.driver.BlockManager.memory.remainingMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.remainingOffHeapMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.remainingOnHeapMem_MB\"\n

                                                                                                                                                                                                                                                                                                                                                                [[creating-instance]] [[blockManager]] BlockManagerSource takes a storage:BlockManager.md[BlockManager] when created.

                                                                                                                                                                                                                                                                                                                                                                BlockManagerSource is created when SparkContext is created.

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerStorageEndpoint/","title":"BlockManagerStorageEndpoint","text":"

                                                                                                                                                                                                                                                                                                                                                                BlockManagerStorageEndpoint is an IsolatedRpcEndpoint.

                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManagerStorageEndpoint/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                BlockManagerStorageEndpoint takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                • RpcEnv
                                                                                                                                                                                                                                                                                                                                                                • BlockManager
                                                                                                                                                                                                                                                                                                                                                                • MapOutputTracker

                                                                                                                                                                                                                                                                                                                                                                  BlockManagerStorageEndpoint is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                  • BlockManager is created
                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"storage/BlockManagerStorageEndpoint/#messages","title":"Messages","text":""},{"location":"storage/BlockManagerStorageEndpoint/#decommissionblockmanager","title":"DecommissionBlockManager

                                                                                                                                                                                                                                                                                                                                                                  When received, receiveAndReply requests the BlockManager to decommissionSelf.

                                                                                                                                                                                                                                                                                                                                                                  DecommissionBlockManager is sent out when BlockManager is requested to decommissionBlockManager.

                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockReplicationPolicy/","title":"BlockReplicationPolicy","text":"

                                                                                                                                                                                                                                                                                                                                                                  BlockReplicationPolicy is...FIXME

                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"storage/BlockStoreClient/","title":"BlockStoreClient","text":"

                                                                                                                                                                                                                                                                                                                                                                  BlockStoreClient is an abstraction of block clients that can fetch blocks from a remote node (an executor or an external service).

                                                                                                                                                                                                                                                                                                                                                                  BlockStoreClient is a Java Closeable.

                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                  BlockStoreClient was known previously as ShuffleClient (SPARK-28593).

                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"storage/BlockStoreClient/#contract","title":"Contract","text":""},{"location":"storage/BlockStoreClient/#fetching-blocks","title":"Fetching Blocks
                                                                                                                                                                                                                                                                                                                                                                  void fetchBlocks(\n  String host,\n  int port,\n  String execId,\n  String[] blockIds,\n  BlockFetchingListener listener,\n  DownloadFileManager downloadFileManager)\n

                                                                                                                                                                                                                                                                                                                                                                  Fetches blocks from a remote node (using DownloadFileManager)

                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                  • BlockTransferService is requested to fetchBlockSync
                                                                                                                                                                                                                                                                                                                                                                  • ShuffleBlockFetcherIterator is requested to sendRequest
                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockStoreClient/#shuffle-metrics","title":"Shuffle Metrics
                                                                                                                                                                                                                                                                                                                                                                  MetricSet shuffleMetrics()\n

                                                                                                                                                                                                                                                                                                                                                                  Shuffle MetricsSet

                                                                                                                                                                                                                                                                                                                                                                  Default: (empty)

                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                  • BlockManager is requested for the Shuffle Metrics Source
                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockStoreClient/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                  • BlockTransferService
                                                                                                                                                                                                                                                                                                                                                                  • ExternalBlockStoreClient
                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"storage/BlockStoreUpdater/","title":"BlockStoreUpdater","text":"

                                                                                                                                                                                                                                                                                                                                                                  BlockStoreUpdater is an abstraction of block store updaters that store blocks (from bytes, whether they start in memory or on disk).

                                                                                                                                                                                                                                                                                                                                                                  BlockStoreUpdater is an internal class of BlockManager.

                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"storage/BlockStoreUpdater/#contract","title":"Contract","text":""},{"location":"storage/BlockStoreUpdater/#block-data","title":"Block Data
                                                                                                                                                                                                                                                                                                                                                                  blockData(): BlockData\n

                                                                                                                                                                                                                                                                                                                                                                  BlockData

                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                  • BlockStoreUpdater is requested to save
                                                                                                                                                                                                                                                                                                                                                                  • TempFileBasedBlockStoreUpdater is requested to readToByteBuffer
                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockStoreUpdater/#readtobytebuffer","title":"readToByteBuffer
                                                                                                                                                                                                                                                                                                                                                                  readToByteBuffer(): ChunkedByteBuffer\n

                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                  • BlockStoreUpdater is requested to save
                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockStoreUpdater/#storing-block-to-disk","title":"Storing Block to Disk
                                                                                                                                                                                                                                                                                                                                                                  saveToDiskStore(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                  • BlockStoreUpdater is requested to save
                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockStoreUpdater/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                  • ByteBufferBlockStoreUpdater
                                                                                                                                                                                                                                                                                                                                                                  • TempFileBasedBlockStoreUpdater
                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"storage/BlockStoreUpdater/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                  BlockStoreUpdater takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                  • Block Size
                                                                                                                                                                                                                                                                                                                                                                  • BlockId
                                                                                                                                                                                                                                                                                                                                                                  • StorageLevel
                                                                                                                                                                                                                                                                                                                                                                  • Scala's ClassTag
                                                                                                                                                                                                                                                                                                                                                                  • tellMaster flag
                                                                                                                                                                                                                                                                                                                                                                  • keepReadLock flag Abstract Class

                                                                                                                                                                                                                                                                                                                                                                    BlockStoreUpdater\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete BlockStoreUpdaters.

                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"storage/BlockStoreUpdater/#saving-block-to-block-store","title":"Saving Block to Block Store
                                                                                                                                                                                                                                                                                                                                                                    save(): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                    save doPut with the putBody function.

                                                                                                                                                                                                                                                                                                                                                                    save\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to putBlockDataAsStream and store block bytes locally
                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockStoreUpdater/#putbody-function","title":"putBody Function

                                                                                                                                                                                                                                                                                                                                                                    With the StorageLevel with replication (above 1), the putBody function triggers replication concurrently (using a Future (Scala) on a separate thread from the ExecutionContextExecutorService).

                                                                                                                                                                                                                                                                                                                                                                    In general, putBody stores the block in the MemoryStore first (if requested based on useMemory of the StorageLevel). putBody saves to a DiskStore (if useMemory is not specified or storing to the MemoryStore failed).

                                                                                                                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                                                                                                                    putBody stores the block in the MemoryStore only even if the useMemory and useDisk flags could both be turned on (true).

                                                                                                                                                                                                                                                                                                                                                                    Spark drops the block to disk later if the memory store can't hold it.

                                                                                                                                                                                                                                                                                                                                                                    With the useMemory of the StorageLevel set, putBody saveDeserializedValuesToMemoryStore for deserialized storage level or saveSerializedValuesToMemoryStore otherwise.

                                                                                                                                                                                                                                                                                                                                                                    putBody saves to a DiskStore when either of the following happens:

                                                                                                                                                                                                                                                                                                                                                                    1. Storing in memory fails and the useDisk (of the StorageLevel) is set
                                                                                                                                                                                                                                                                                                                                                                    2. useMemory of the StorageLevel is not set yet the useDisk is

                                                                                                                                                                                                                                                                                                                                                                    putBody getCurrentBlockStatus and checks if it is in either the memory or disk store.

                                                                                                                                                                                                                                                                                                                                                                    In the end, putBody reportBlockStatus (if the given tellMaster flag and the tellMaster flag of the BlockInfo are both enabled) and addUpdatedBlockStatusToTaskMetrics.

                                                                                                                                                                                                                                                                                                                                                                    putBody prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                    Put block [blockId] locally took [timeUsed] ms\n

                                                                                                                                                                                                                                                                                                                                                                    putBody prints out the following WARN message to the logs when an attempt to store a block in memory fails and the useDisk is set:

                                                                                                                                                                                                                                                                                                                                                                    Persisting block [blockId] to disk instead.\n
                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockStoreUpdater/#saving-deserialized-values-to-memorystore","title":"Saving Deserialized Values to MemoryStore
                                                                                                                                                                                                                                                                                                                                                                    saveDeserializedValuesToMemoryStore(\n  inputStream: InputStream): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                    saveDeserializedValuesToMemoryStore...FIXME

                                                                                                                                                                                                                                                                                                                                                                    saveDeserializedValuesToMemoryStore\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                    • BlockStoreUpdater is requested to save a block (with memory deserialized storage level)
                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockStoreUpdater/#saving-serialized-values-to-memorystore","title":"Saving Serialized Values to MemoryStore
                                                                                                                                                                                                                                                                                                                                                                    saveSerializedValuesToMemoryStore(\n  bytes: ChunkedByteBuffer): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                    saveSerializedValuesToMemoryStore...FIXME

                                                                                                                                                                                                                                                                                                                                                                    saveSerializedValuesToMemoryStore\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                    • BlockStoreUpdater is requested to save a block (with memory serialized storage level)
                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockStoreUpdater/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                    BlockStoreUpdater is an abstract class and logging is configured using the logger of the implementations.

                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockTransferService/","title":"BlockTransferService","text":"

                                                                                                                                                                                                                                                                                                                                                                    BlockTransferService is an extension of the BlockStoreClient abstraction for shuffle clients that can fetch and upload blocks of data (synchronously or asynchronously).

                                                                                                                                                                                                                                                                                                                                                                    BlockTransferService is a network service available by a host name and a port.

                                                                                                                                                                                                                                                                                                                                                                    BlockTransferService was introduced in SPARK-3019 Pluggable block transfer interface (BlockTransferService).

                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"storage/BlockTransferService/#contract","title":"Contract","text":""},{"location":"storage/BlockTransferService/#host-name","title":"Host Name
                                                                                                                                                                                                                                                                                                                                                                    hostName: String\n

                                                                                                                                                                                                                                                                                                                                                                    Host name this service is listening on

                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockTransferService/#initializing","title":"Initializing
                                                                                                                                                                                                                                                                                                                                                                    init(\n  blockDataManager: BlockDataManager): Unit\n

                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockTransferService/#port","title":"Port
                                                                                                                                                                                                                                                                                                                                                                    port: Int\n

                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockTransferService/#uploading-block-asynchronously","title":"Uploading Block Asynchronously
                                                                                                                                                                                                                                                                                                                                                                    uploadBlock(\n  hostname: String,\n  port: Int,\n  execId: String,\n  blockId: BlockId,\n  blockData: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Future[Unit]\n

                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                    • BlockTransferService is requested to uploadBlockSync
                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockTransferService/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                    • NettyBlockTransferService
                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"storage/BlockTransferService/#uploading-block-synchronously","title":"Uploading Block Synchronously
                                                                                                                                                                                                                                                                                                                                                                    uploadBlockSync(\n  hostname: String,\n  port: Int,\n  execId: String,\n  blockId: BlockId,\n  blockData: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                    uploadBlockSync uploadBlock and waits till it finishes.

                                                                                                                                                                                                                                                                                                                                                                    uploadBlockSync\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to replicate
                                                                                                                                                                                                                                                                                                                                                                    • ShuffleMigrationRunnable is requested to run
                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ByteBufferBlockStoreUpdater/","title":"ByteBufferBlockStoreUpdater","text":"

                                                                                                                                                                                                                                                                                                                                                                    ByteBufferBlockStoreUpdater is a BlockStoreUpdater (that BlockManager uses for storing a block from bytes already in memory).

                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"storage/ByteBufferBlockStoreUpdater/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                    ByteBufferBlockStoreUpdater takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                    • BlockId
                                                                                                                                                                                                                                                                                                                                                                    • StorageLevel
                                                                                                                                                                                                                                                                                                                                                                    • ClassTag (Scala)
                                                                                                                                                                                                                                                                                                                                                                    • ChunkedByteBuffer
                                                                                                                                                                                                                                                                                                                                                                    • tellMaster flag (default: true)
                                                                                                                                                                                                                                                                                                                                                                    • keepReadLock flag (default: false)

                                                                                                                                                                                                                                                                                                                                                                      ByteBufferBlockStoreUpdater is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                      • BlockManager is requested to store a block (bytes) locally
                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"storage/ByteBufferBlockStoreUpdater/#block-data","title":"Block Data
                                                                                                                                                                                                                                                                                                                                                                      blockData(): BlockData\n

                                                                                                                                                                                                                                                                                                                                                                      blockData creates a ByteBufferBlockData (with the ChunkedByteBuffer).

                                                                                                                                                                                                                                                                                                                                                                      blockData\u00a0is part of the BlockStoreUpdater abstraction.

                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/ByteBufferBlockStoreUpdater/#readtobytebuffer","title":"readToByteBuffer
                                                                                                                                                                                                                                                                                                                                                                      readToByteBuffer(): ChunkedByteBuffer\n

                                                                                                                                                                                                                                                                                                                                                                      readToByteBuffer simply gives the ChunkedByteBuffer (it was created with).

                                                                                                                                                                                                                                                                                                                                                                      readToByteBuffer\u00a0is part of the BlockStoreUpdater abstraction.

                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/ByteBufferBlockStoreUpdater/#storing-block-to-disk","title":"Storing Block to Disk
                                                                                                                                                                                                                                                                                                                                                                      saveToDiskStore(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                      saveToDiskStore requests the DiskStore (of the parent BlockManager) to putBytes (with the BlockId and the ChunkedByteBuffer).

                                                                                                                                                                                                                                                                                                                                                                      saveToDiskStore\u00a0is part of the BlockStoreUpdater abstraction.

                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskBlockManager/","title":"DiskBlockManager","text":"

                                                                                                                                                                                                                                                                                                                                                                      DiskBlockManager manages a logical mapping of logical blocks and their physical on-disk locations for a BlockManager.

                                                                                                                                                                                                                                                                                                                                                                      By default, one block is mapped to one file with a name given by BlockId. It is however possible to have a block to be mapped to a segment of a file only.

                                                                                                                                                                                                                                                                                                                                                                      Block files are hashed among the local directories.

                                                                                                                                                                                                                                                                                                                                                                      DiskBlockManager is used to create a DiskStore.

                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"storage/DiskBlockManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                      DiskBlockManager takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                      • SparkConf
                                                                                                                                                                                                                                                                                                                                                                      • deleteFilesOnStop flag

                                                                                                                                                                                                                                                                                                                                                                        When created, DiskBlockManager creates the local directories for block storage and initializes the internal subDirs collection of locks for every local directory.

                                                                                                                                                                                                                                                                                                                                                                        DiskBlockManager createLocalDirsForMergedShuffleBlocks.

                                                                                                                                                                                                                                                                                                                                                                        In the end, DiskBlockManager registers a shutdown hook to clean up the local directories for blocks.

                                                                                                                                                                                                                                                                                                                                                                        DiskBlockManager is created for BlockManager.

                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/DiskBlockManager/#createlocaldirsformergedshuffleblocks","title":"createLocalDirsForMergedShuffleBlocks
                                                                                                                                                                                                                                                                                                                                                                        createLocalDirsForMergedShuffleBlocks(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                        createLocalDirsForMergedShuffleBlocks is a noop with isPushBasedShuffleEnabled disabled (YARN mode only).

                                                                                                                                                                                                                                                                                                                                                                        createLocalDirsForMergedShuffleBlocks...FIXME

                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskBlockManager/#accessing-diskblockmanager","title":"Accessing DiskBlockManager","text":"

                                                                                                                                                                                                                                                                                                                                                                        DiskBlockManager is available using SparkEnv.

                                                                                                                                                                                                                                                                                                                                                                        org.apache.spark.SparkEnv.get.blockManager.diskBlockManager\n
                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/DiskBlockManager/#local-directories-for-block-storage","title":"Local Directories for Block Storage

                                                                                                                                                                                                                                                                                                                                                                        DiskBlockManager creates blockmgr directory in every local root directory when created.

                                                                                                                                                                                                                                                                                                                                                                        DiskBlockManager uses localDirs internal registry of all the blockmgr directories.

                                                                                                                                                                                                                                                                                                                                                                        DiskBlockManager expects at least one local directory or prints out the following ERROR message to the logs and exits the JVM (with exit code 53):

                                                                                                                                                                                                                                                                                                                                                                        Failed to create any local dir.\n

                                                                                                                                                                                                                                                                                                                                                                        localDirs is used when:

                                                                                                                                                                                                                                                                                                                                                                        • DiskBlockManager is created (and creates localDirsString and subDirs), requested to look up a file (among local subdirectories) and doStop
                                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to register with an external shuffle server
                                                                                                                                                                                                                                                                                                                                                                        • BasePythonRunner (PySpark) is requested to compute
                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskBlockManager/#localdirsstring","title":"localDirsString

                                                                                                                                                                                                                                                                                                                                                                        DiskBlockManager uses localDirsString internal registry of the paths of the local blockmgr directories.

                                                                                                                                                                                                                                                                                                                                                                        localDirsString is used by BlockManager when requested for getLocalDiskDirs.

                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskBlockManager/#creating-blockmgr-directory-in-every-local-root-directory","title":"Creating blockmgr Directory in Every Local Root Directory
                                                                                                                                                                                                                                                                                                                                                                        createLocalDirs(\n  conf: SparkConf): Array[File]\n

                                                                                                                                                                                                                                                                                                                                                                        createLocalDirs creates blockmgr local directories for storing block data.

                                                                                                                                                                                                                                                                                                                                                                        createLocalDirs creates a blockmgr-[randomUUID] directory under every root directory for local storage and prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                        Created local directory at [localDir]\n

                                                                                                                                                                                                                                                                                                                                                                        In case of an exception, createLocalDirs prints out the following ERROR message to the logs and ignore the directory:

                                                                                                                                                                                                                                                                                                                                                                        Failed to create local dir in [rootDir]. Ignoring this directory.\n
                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskBlockManager/#file-locks-for-local-block-store-directories","title":"File Locks for Local Block Store Directories
                                                                                                                                                                                                                                                                                                                                                                        subDirs: Array[Array[File]]\n

                                                                                                                                                                                                                                                                                                                                                                        subDirs is a lookup table for file locks of every local block directory (with the first dimension for local directories and the second for locks).

                                                                                                                                                                                                                                                                                                                                                                        The number of block subdirectories is controlled by spark.diskStore.subDirectories configuration property.

                                                                                                                                                                                                                                                                                                                                                                        subDirs(dirId)(subDirId) is used to access subDirId subdirectory in dirId local directory.

                                                                                                                                                                                                                                                                                                                                                                        subDirs is used when:

                                                                                                                                                                                                                                                                                                                                                                        • DiskBlockManager is requested for a block file and all the block files
                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskBlockManager/#finding-block-file-and-creating-parent-directories","title":"Finding Block File (and Creating Parent Directories)
                                                                                                                                                                                                                                                                                                                                                                        getFile(\n  blockId: BlockId): File\ngetFile(\n  filename: String): File\n

                                                                                                                                                                                                                                                                                                                                                                        getFile computes a hash of the file name of the input BlockId that is used for the name of the parent directory and subdirectory.

                                                                                                                                                                                                                                                                                                                                                                        getFile creates the subdirectory unless it already exists.

                                                                                                                                                                                                                                                                                                                                                                        getFile is used when:

                                                                                                                                                                                                                                                                                                                                                                        • DiskBlockManager is requested to containsBlock, createTempLocalBlock, createTempShuffleBlock

                                                                                                                                                                                                                                                                                                                                                                        • DiskStore is requested to getBytes, remove, contains, and put

                                                                                                                                                                                                                                                                                                                                                                        • IndexShuffleBlockResolver is requested to getDataFile and getIndexFile

                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskBlockManager/#createtempshuffleblock","title":"createTempShuffleBlock
                                                                                                                                                                                                                                                                                                                                                                        createTempShuffleBlock(): (TempShuffleBlockId, File)\n

                                                                                                                                                                                                                                                                                                                                                                        createTempShuffleBlock creates a temporary TempShuffleBlockId block.

                                                                                                                                                                                                                                                                                                                                                                        createTempShuffleBlock...FIXME

                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskBlockManager/#registering-shutdown-hook","title":"Registering Shutdown Hook
                                                                                                                                                                                                                                                                                                                                                                        addShutdownHook(): AnyRef\n

                                                                                                                                                                                                                                                                                                                                                                        addShutdownHook registers a shutdown hook to execute doStop at shutdown.

                                                                                                                                                                                                                                                                                                                                                                        When executed, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                        Adding shutdown hook\n

                                                                                                                                                                                                                                                                                                                                                                        addShutdownHook adds the shutdown hook so it prints the following INFO message and executes doStop:

                                                                                                                                                                                                                                                                                                                                                                        Shutdown hook called\n
                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskBlockManager/#getting-writable-directories-in-yarn","title":"Getting Writable Directories in YARN
                                                                                                                                                                                                                                                                                                                                                                        getYarnLocalDirs(\n  conf: SparkConf): String\n

                                                                                                                                                                                                                                                                                                                                                                        getYarnLocalDirs uses conf SparkConf to read LOCAL_DIRS environment variable with comma-separated local directories (that have already been created and secured so that only the user has access to them).

                                                                                                                                                                                                                                                                                                                                                                        getYarnLocalDirs throws an Exception when LOCAL_DIRS environment variable was not set:

                                                                                                                                                                                                                                                                                                                                                                        Yarn Local dirs can't be empty\n
                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskBlockManager/#checking-whether-spark-runs-on-yarn","title":"Checking Whether Spark Runs on YARN
                                                                                                                                                                                                                                                                                                                                                                        isRunningInYarnContainer(\n  conf: SparkConf): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                        isRunningInYarnContainer uses conf SparkConf to read Hadoop YARN's CONTAINER_ID environment variable to find out if Spark runs in a YARN container (that is exported by YARN NodeManager).

                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskBlockManager/#getting-all-blocks-from-files-stored-on-disk","title":"Getting All Blocks (From Files Stored On Disk)
                                                                                                                                                                                                                                                                                                                                                                        getAllBlocks(): Seq[BlockId]\n

                                                                                                                                                                                                                                                                                                                                                                        getAllBlocks gets all the blocks stored on disk.

                                                                                                                                                                                                                                                                                                                                                                        Internally, getAllBlocks takes the block files and returns their names (as BlockId).

                                                                                                                                                                                                                                                                                                                                                                        getAllBlocks is used when BlockManager is requested to find IDs of existing blocks for a given filter.

                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskBlockManager/#all-block-files","title":"All Block Files
                                                                                                                                                                                                                                                                                                                                                                        getAllFiles(): Seq[File]\n

                                                                                                                                                                                                                                                                                                                                                                        getAllFiles uses the subDirs registry to list all the files (in all the directories) that are currently stored on disk by this disk manager.

                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskBlockManager/#stopping","title":"Stopping
                                                                                                                                                                                                                                                                                                                                                                        stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                        stop...FIXME

                                                                                                                                                                                                                                                                                                                                                                        stop is used when BlockManager is requested to stop.

                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskBlockManager/#stopping-diskblockmanager-removing-local-directories-for-blocks","title":"Stopping DiskBlockManager (Removing Local Directories for Blocks)
                                                                                                                                                                                                                                                                                                                                                                        doStop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                        doStop deletes the local directories recursively (only when the constructor's deleteFilesOnStop is enabled and the parent directories are not registered to be removed at shutdown).

                                                                                                                                                                                                                                                                                                                                                                        doStop is used when:

                                                                                                                                                                                                                                                                                                                                                                        • DiskBlockManager is requested to shut down or stop
                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskBlockManager/#demo","title":"Demo

                                                                                                                                                                                                                                                                                                                                                                        Demo: DiskBlockManager and Block Data

                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskBlockManager/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                        Enable ALL logging level for org.apache.spark.storage.DiskBlockManager logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                        Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                        log4j.logger.org.apache.spark.storage.DiskBlockManager=ALL\n

                                                                                                                                                                                                                                                                                                                                                                        Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskBlockObjectWriter/","title":"DiskBlockObjectWriter","text":"

                                                                                                                                                                                                                                                                                                                                                                        DiskBlockObjectWriter is a disk writer of BlockManager.

                                                                                                                                                                                                                                                                                                                                                                        DiskBlockObjectWriter is an OutputStream (Java) that BlockManager offers for writing data blocks to disk.

                                                                                                                                                                                                                                                                                                                                                                        DiskBlockObjectWriter is used when:

                                                                                                                                                                                                                                                                                                                                                                        • BypassMergeSortShuffleWriter is requested for partition writers

                                                                                                                                                                                                                                                                                                                                                                        • UnsafeSorterSpillWriter is requested for a partition writer

                                                                                                                                                                                                                                                                                                                                                                        • ShuffleExternalSorter is requested to writeSortedFile

                                                                                                                                                                                                                                                                                                                                                                        • ExternalSorter is requested to spillMemoryIteratorToDisk

                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/DiskBlockObjectWriter/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                        DiskBlockObjectWriter takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                        • File (Java)
                                                                                                                                                                                                                                                                                                                                                                        • SerializerManager
                                                                                                                                                                                                                                                                                                                                                                        • SerializerInstance
                                                                                                                                                                                                                                                                                                                                                                        • Buffer size
                                                                                                                                                                                                                                                                                                                                                                        • syncWrites flag (based on spark.shuffle.sync configuration property)
                                                                                                                                                                                                                                                                                                                                                                        • ShuffleWriteMetricsReporter
                                                                                                                                                                                                                                                                                                                                                                        • BlockId (default: null)

                                                                                                                                                                                                                                                                                                                                                                          DiskBlockObjectWriter is created when:

                                                                                                                                                                                                                                                                                                                                                                          • BlockManager is requested for a disk writer
                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"storage/DiskBlockObjectWriter/#buffer-size","title":"Buffer Size

                                                                                                                                                                                                                                                                                                                                                                          DiskBlockObjectWriter is given a buffer size when created.

                                                                                                                                                                                                                                                                                                                                                                          The buffer size is specified by BlockManager and is based on spark.shuffle.file.buffer configuration property (in most cases) or is hardcoded to 32k (in some cases but is in fact the default value).

                                                                                                                                                                                                                                                                                                                                                                          The buffer size is exactly the buffer size of the BufferedOutputStream.

                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/DiskBlockObjectWriter/#serializationstream","title":"SerializationStream

                                                                                                                                                                                                                                                                                                                                                                          DiskBlockObjectWriter manages a SerializationStream for writing a key-value record:

                                                                                                                                                                                                                                                                                                                                                                          • Opens it when requested to open

                                                                                                                                                                                                                                                                                                                                                                          • Closes it when requested to commitAndGet

                                                                                                                                                                                                                                                                                                                                                                          • Dereferences it (nulls it) when closeResources

                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/DiskBlockObjectWriter/#states","title":"States

                                                                                                                                                                                                                                                                                                                                                                          DiskBlockObjectWriter can be in one of the following states (that match the state of the underlying output streams):

                                                                                                                                                                                                                                                                                                                                                                          • Initialized
                                                                                                                                                                                                                                                                                                                                                                          • Open
                                                                                                                                                                                                                                                                                                                                                                          • Closed
                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/DiskBlockObjectWriter/#writing-out-record","title":"Writing Out Record
                                                                                                                                                                                                                                                                                                                                                                          write(\n  key: Any,\n  value: Any): Unit\n

                                                                                                                                                                                                                                                                                                                                                                          write opens the underlying stream unless open already.

                                                                                                                                                                                                                                                                                                                                                                          write requests the SerializationStream to write the key and then the value.

                                                                                                                                                                                                                                                                                                                                                                          In the end, write updates the write metrics.

                                                                                                                                                                                                                                                                                                                                                                          write is used when:

                                                                                                                                                                                                                                                                                                                                                                          • BypassMergeSortShuffleWriter is requested to write records of a partition

                                                                                                                                                                                                                                                                                                                                                                          • ExternalAppendOnlyMap is requested to spillMemoryIteratorToDisk

                                                                                                                                                                                                                                                                                                                                                                          • ExternalSorter is requested to write all records into a partitioned file

                                                                                                                                                                                                                                                                                                                                                                            • SpillableIterator is requested to spill
                                                                                                                                                                                                                                                                                                                                                                          • WritablePartitionedPairCollection is requested for a destructiveSortedWritablePartitionedIterator

                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/DiskBlockObjectWriter/#commitandget","title":"commitAndGet
                                                                                                                                                                                                                                                                                                                                                                          commitAndGet(): FileSegment\n

                                                                                                                                                                                                                                                                                                                                                                          With streamOpen enabled, commitAndGet...FIXME

                                                                                                                                                                                                                                                                                                                                                                          Otherwise, commitAndGet returns a new FileSegment (with the File, committedPosition and 0 length).

                                                                                                                                                                                                                                                                                                                                                                          commitAndGet is used when:

                                                                                                                                                                                                                                                                                                                                                                          • BypassMergeSortShuffleWriter is requested to write
                                                                                                                                                                                                                                                                                                                                                                          • ShuffleExternalSorter is requested to writeSortedFile
                                                                                                                                                                                                                                                                                                                                                                          • DiskBlockObjectWriter is requested to close
                                                                                                                                                                                                                                                                                                                                                                          • ExternalAppendOnlyMap is requested to spillMemoryIteratorToDisk
                                                                                                                                                                                                                                                                                                                                                                          • ExternalSorter is requested to spillMemoryIteratorToDisk, writePartitionedFile
                                                                                                                                                                                                                                                                                                                                                                          • UnsafeSorterSpillWriter is requested to close
                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/DiskBlockObjectWriter/#committing-writes-and-closing-resources","title":"Committing Writes and Closing Resources
                                                                                                                                                                                                                                                                                                                                                                          close(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                          Only if initialized, close commitAndGet followed by closeResources. Otherwise, close does nothing.

                                                                                                                                                                                                                                                                                                                                                                          close is used when:

                                                                                                                                                                                                                                                                                                                                                                          • FIXME
                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/DiskBlockObjectWriter/#revertpartialwritesandclose","title":"revertPartialWritesAndClose
                                                                                                                                                                                                                                                                                                                                                                          revertPartialWritesAndClose(): File\n

                                                                                                                                                                                                                                                                                                                                                                          revertPartialWritesAndClose...FIXME

                                                                                                                                                                                                                                                                                                                                                                          revertPartialWritesAndClose is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/DiskBlockObjectWriter/#writing-bytes-from-byte-array-starting-from-offset","title":"Writing Bytes (From Byte Array Starting From Offset)
                                                                                                                                                                                                                                                                                                                                                                          write(\n  kvBytes: Array[Byte],\n  offs: Int,\n  len: Int): Unit\n

                                                                                                                                                                                                                                                                                                                                                                          write...FIXME

                                                                                                                                                                                                                                                                                                                                                                          write is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/DiskBlockObjectWriter/#opening-diskblockobjectwriter","title":"Opening DiskBlockObjectWriter
                                                                                                                                                                                                                                                                                                                                                                          open(): DiskBlockObjectWriter\n

                                                                                                                                                                                                                                                                                                                                                                          open opens the DiskBlockObjectWriter, i.e. initializes and re-sets bs and objOut internal output streams.

                                                                                                                                                                                                                                                                                                                                                                          Internally, open makes sure that DiskBlockObjectWriter is not closed (hasBeenClosed flag is disabled). If it was, open throws a IllegalStateException:

                                                                                                                                                                                                                                                                                                                                                                          Writer already closed. Cannot be reopened.\n

                                                                                                                                                                                                                                                                                                                                                                          Unless DiskBlockObjectWriter has already been initialized (initialized flag is enabled), open initializes it (and turns initialized flag on).

                                                                                                                                                                                                                                                                                                                                                                          Regardless of whether DiskBlockObjectWriter was already initialized or not, open requests SerializerManager to wrap mcs output stream for encryption and compression (for blockId) and sets it as bs.

                                                                                                                                                                                                                                                                                                                                                                          open requests the SerializerInstance to serialize bs output stream and sets it as objOut.

                                                                                                                                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                                                                                                                                          open uses the SerializerInstance that was used to create the DiskBlockObjectWriter.

                                                                                                                                                                                                                                                                                                                                                                          In the end, open turns streamOpen flag on.

                                                                                                                                                                                                                                                                                                                                                                          open is used when DiskBlockObjectWriter writes out a record or bytes from a specified byte array and the stream is not open yet.

                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/DiskBlockObjectWriter/#initialization","title":"Initialization
                                                                                                                                                                                                                                                                                                                                                                          initialize(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                          initialize creates a FileOutputStream to write to the file (with theappend enabled) and takes the FileChannel associated with this file output stream.

                                                                                                                                                                                                                                                                                                                                                                          initialize creates a TimeTrackingOutputStream (with the ShuffleWriteMetricsReporter and the FileOutputStream).

                                                                                                                                                                                                                                                                                                                                                                          With checksumEnabled, initialize...FIXME

                                                                                                                                                                                                                                                                                                                                                                          In the end, initialize creates a BufferedOutputStream.

                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/DiskBlockObjectWriter/#checksumenabled-flag","title":"checksumEnabled Flag

                                                                                                                                                                                                                                                                                                                                                                          DiskBlockObjectWriter defines checksumEnabled flag to...FIXME

                                                                                                                                                                                                                                                                                                                                                                          checksumEnabled is false by default and can be enabled using setChecksum.

                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/DiskBlockObjectWriter/#setchecksum","title":"setChecksum
                                                                                                                                                                                                                                                                                                                                                                          setChecksum(\n  checksum: Checksum): Unit\n

                                                                                                                                                                                                                                                                                                                                                                          setChecksum...FIXME

                                                                                                                                                                                                                                                                                                                                                                          setChecksum is used when:

                                                                                                                                                                                                                                                                                                                                                                          • BypassMergeSortShuffleWriter is requested to write records (with spark.shuffle.checksum.enabled enabled)
                                                                                                                                                                                                                                                                                                                                                                          • ShuffleExternalSorter is requested to writeSortedFile (with spark.shuffle.checksum.enabled enabled)
                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/DiskBlockObjectWriter/#recording-bytes-written","title":"Recording Bytes Written
                                                                                                                                                                                                                                                                                                                                                                          recordWritten(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                          recordWritten increases the numRecordsWritten counter.

                                                                                                                                                                                                                                                                                                                                                                          recordWritten requests the ShuffleWriteMetricsReporter to incRecordsWritten.

                                                                                                                                                                                                                                                                                                                                                                          recordWritten updates the bytes written metric every 16384 bytes written (based on the numRecordsWritten counter).

                                                                                                                                                                                                                                                                                                                                                                          recordWritten is used when:

                                                                                                                                                                                                                                                                                                                                                                          • ShuffleExternalSorter is requested to writeSortedFile
                                                                                                                                                                                                                                                                                                                                                                          • DiskBlockObjectWriter is requested to write
                                                                                                                                                                                                                                                                                                                                                                          • UnsafeSorterSpillWriter is requested to write
                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/DiskBlockObjectWriter/#updating-bytes-written-metric","title":"Updating Bytes Written Metric
                                                                                                                                                                                                                                                                                                                                                                          updateBytesWritten(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                          updateBytesWritten requests the FileChannel for the file position (i.e., the number of bytes from the beginning of the file to the current position) that is used to incBytesWritten (using the ShuffleWriteMetricsReporter and the reportedPosition counter).

                                                                                                                                                                                                                                                                                                                                                                          In the end, updateBytesWritten updates the reportedPosition counter to the current file position (so it can report incBytesWritten properly).

                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/DiskBlockObjectWriter/#bufferedoutputstream","title":"BufferedOutputStream
                                                                                                                                                                                                                                                                                                                                                                          mcs: ManualCloseOutputStream\n

                                                                                                                                                                                                                                                                                                                                                                          DiskBlockObjectWriter creates a custom BufferedOutputStream (Java) when requested to initialize.

                                                                                                                                                                                                                                                                                                                                                                          The BufferedOutputStream is closed (and dereferenced) in closeResources.

                                                                                                                                                                                                                                                                                                                                                                          The BufferedOutputStream is used to create the OutputStream when requested to open.

                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/DiskBlockObjectWriter/#outputstream","title":"OutputStream
                                                                                                                                                                                                                                                                                                                                                                          bs: OutputStream\n

                                                                                                                                                                                                                                                                                                                                                                          DiskBlockObjectWriter creates an OutputStream when requested to open. The OutputStream can be encrypted and compressed if enabled.

                                                                                                                                                                                                                                                                                                                                                                          The OutputStream is closed (and dereferenced) in closeResources.

                                                                                                                                                                                                                                                                                                                                                                          The OutputStream is used to create the SerializationStream when requested to open.

                                                                                                                                                                                                                                                                                                                                                                          The OutputStream is requested for the following:

                                                                                                                                                                                                                                                                                                                                                                          • Write bytes out in write
                                                                                                                                                                                                                                                                                                                                                                          • Flush in flush (and commitAndGet)
                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/DiskStore/","title":"DiskStore","text":"

                                                                                                                                                                                                                                                                                                                                                                          DiskStore manages data blocks on disk for BlockManager.

                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"storage/DiskStore/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                          DiskStore takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                          • SparkConf
                                                                                                                                                                                                                                                                                                                                                                          • DiskBlockManager
                                                                                                                                                                                                                                                                                                                                                                          • SecurityManager

                                                                                                                                                                                                                                                                                                                                                                            DiskStore is created\u00a0for BlockManager.

                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/DiskStore/#block-sizes","title":"Block Sizes
                                                                                                                                                                                                                                                                                                                                                                            blockSizes: ConcurrentHashMap[BlockId, Long]\n

                                                                                                                                                                                                                                                                                                                                                                            DiskStore uses ConcurrentHashMap (Java) as a registry of blocks and the data size (on disk).

                                                                                                                                                                                                                                                                                                                                                                            A new entry is added when put and moveFileToBlock.

                                                                                                                                                                                                                                                                                                                                                                            An entry is removed when remove.

                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/DiskStore/#putbytes","title":"putBytes
                                                                                                                                                                                                                                                                                                                                                                            putBytes(\n  blockId: BlockId,\n  bytes: ChunkedByteBuffer): Unit\n

                                                                                                                                                                                                                                                                                                                                                                            putBytes put the block and writes the buffer out (to the given channel).

                                                                                                                                                                                                                                                                                                                                                                            putBytes\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                            • ByteBufferBlockStoreUpdater is requested to saveToDiskStore
                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to dropFromMemory
                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/DiskStore/#getbytes","title":"getBytes
                                                                                                                                                                                                                                                                                                                                                                            getBytes(\n  blockId: BlockId): BlockData\ngetBytes(\n  f: File,\n  blockSize: Long): BlockData\n

                                                                                                                                                                                                                                                                                                                                                                            getBytes requests the DiskBlockManager for the block file and the size.

                                                                                                                                                                                                                                                                                                                                                                            getBytes requests the SecurityManager for getIOEncryptionKey and returns a EncryptedBlockData if available or a DiskBlockData otherwise.

                                                                                                                                                                                                                                                                                                                                                                            getBytes\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                            • TempFileBasedBlockStoreUpdater is requested to blockData
                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to getLocalValues, doGetLocalBytes
                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/DiskStore/#getsize","title":"getSize
                                                                                                                                                                                                                                                                                                                                                                            getSize(\n  blockId: BlockId): Long\n

                                                                                                                                                                                                                                                                                                                                                                            getSize looks up the block in the blockSizes registry.

                                                                                                                                                                                                                                                                                                                                                                            getSize\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to getStatus, getCurrentBlockStatus, doPutIterator
                                                                                                                                                                                                                                                                                                                                                                            • DiskStore is requested for the block bytes
                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/DiskStore/#movefiletoblock","title":"moveFileToBlock
                                                                                                                                                                                                                                                                                                                                                                            moveFileToBlock(\n  sourceFile: File,\n  blockSize: Long,\n  targetBlockId: BlockId): Unit\n

                                                                                                                                                                                                                                                                                                                                                                            moveFileToBlock...FIXME

                                                                                                                                                                                                                                                                                                                                                                            moveFileToBlock\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                            • TempFileBasedBlockStoreUpdater is requested to saveToDiskStore
                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/DiskStore/#checking-if-block-file-exists","title":"Checking if Block File Exists
                                                                                                                                                                                                                                                                                                                                                                            contains(\n  blockId: BlockId): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                            contains requests the DiskBlockManager for the block file and checks whether the file actually exists or not.

                                                                                                                                                                                                                                                                                                                                                                            contains\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to getStatus, getCurrentBlockStatus, getLocalValues, doGetLocalBytes, dropFromMemory
                                                                                                                                                                                                                                                                                                                                                                            • DiskStore is requested to put
                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/DiskStore/#persisting-block-to-disk","title":"Persisting Block to Disk
                                                                                                                                                                                                                                                                                                                                                                            put(\n  blockId: BlockId)(\n  writeFunc: WritableByteChannel => Unit): Unit\n

                                                                                                                                                                                                                                                                                                                                                                            put prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                            Attempting to put block [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                            put requests the DiskBlockManager for the block file for the input BlockId.

                                                                                                                                                                                                                                                                                                                                                                            put opens the block file for writing (wrapped into a CountingWritableChannel to count the bytes written). put executes the given writeFunc function (with the WritableByteChannel of the block file) and saves the bytes written (to the blockSizes registry).

                                                                                                                                                                                                                                                                                                                                                                            In the end, put prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                            Block [fileName] stored as [size] file on disk in [time] ms\n

                                                                                                                                                                                                                                                                                                                                                                            In case of any exception, put deletes the block file.

                                                                                                                                                                                                                                                                                                                                                                            put throws an IllegalStateException when the block is already stored:

                                                                                                                                                                                                                                                                                                                                                                            Block [blockId] is already present in the disk store\n

                                                                                                                                                                                                                                                                                                                                                                            put\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to doPutIterator and dropFromMemory
                                                                                                                                                                                                                                                                                                                                                                            • DiskStore is requested to putBytes
                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/DiskStore/#removing-block","title":"Removing Block
                                                                                                                                                                                                                                                                                                                                                                            remove(\n  blockId: BlockId): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                            remove...FIXME

                                                                                                                                                                                                                                                                                                                                                                            remove\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to removeBlockInternal
                                                                                                                                                                                                                                                                                                                                                                            • DiskStore is requested to put (and an IOException is thrown)
                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/DiskStore/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                            Enable ALL logging level for org.apache.spark.storage.DiskStore logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                            Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                            log4j.logger.org.apache.spark.storage.DiskStore=ALL\n

                                                                                                                                                                                                                                                                                                                                                                            Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/ExternalBlockStoreClient/","title":"ExternalBlockStoreClient","text":"

                                                                                                                                                                                                                                                                                                                                                                            ExternalBlockStoreClient is a BlockStoreClient that the driver and executors use when spark.shuffle.service.enabled configuration property is enabled.

                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/ExternalBlockStoreClient/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                            ExternalBlockStoreClient takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                            • TransportConf
                                                                                                                                                                                                                                                                                                                                                                            • SecretKeyHolder
                                                                                                                                                                                                                                                                                                                                                                            • authEnabled flag
                                                                                                                                                                                                                                                                                                                                                                            • registrationTimeoutMs

                                                                                                                                                                                                                                                                                                                                                                              ExternalBlockStoreClient is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                              • SparkEnv utility is requested to create a SparkEnv (for the driver and executors) with spark.shuffle.service.enabled configuration property enabled
                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"storage/FallbackStorage/","title":"FallbackStorage","text":"

                                                                                                                                                                                                                                                                                                                                                                              FallbackStorage is...FIXME

                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"storage/MemoryStore/","title":"MemoryStore","text":"

                                                                                                                                                                                                                                                                                                                                                                              MemoryStore manages blocks of data in memory for BlockManager.

                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"storage/MemoryStore/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                              MemoryStore takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                              • SparkConf
                                                                                                                                                                                                                                                                                                                                                                              • BlockInfoManager
                                                                                                                                                                                                                                                                                                                                                                              • SerializerManager
                                                                                                                                                                                                                                                                                                                                                                              • MemoryManager
                                                                                                                                                                                                                                                                                                                                                                              • BlockEvictionHandler

                                                                                                                                                                                                                                                                                                                                                                                MemoryStore is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is created

                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/MemoryStore/#blocks","title":"Blocks
                                                                                                                                                                                                                                                                                                                                                                                entries: LinkedHashMap[BlockId, MemoryEntry[_]]\n

                                                                                                                                                                                                                                                                                                                                                                                MemoryStore creates a LinkedHashMap (Java) of blocks (as MemoryEntries per BlockId) when created.

                                                                                                                                                                                                                                                                                                                                                                                entries uses access-order ordering mode where the order of iteration is the order in which the entries were last accessed (from least-recently accessed to most-recently). That gives LRU cache behaviour when MemoryStore is requested to evict blocks.

                                                                                                                                                                                                                                                                                                                                                                                MemoryEntries are added in putBytes and putIterator.

                                                                                                                                                                                                                                                                                                                                                                                MemoryEntries are removed in remove, clear, and while evicting blocks to free up memory.

                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#deserializedmemoryentry","title":"DeserializedMemoryEntry

                                                                                                                                                                                                                                                                                                                                                                                DeserializedMemoryEntry is a MemoryEntry for block values with the following:

                                                                                                                                                                                                                                                                                                                                                                                • Array[T] (for the values)
                                                                                                                                                                                                                                                                                                                                                                                • size
                                                                                                                                                                                                                                                                                                                                                                                • ON_HEAP memory mode
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#serializedmemoryentry","title":"SerializedMemoryEntry

                                                                                                                                                                                                                                                                                                                                                                                SerializedMemoryEntry is a MemoryEntry for block bytes with the following:

                                                                                                                                                                                                                                                                                                                                                                                • ChunkedByteBuffer (for the serialized values)
                                                                                                                                                                                                                                                                                                                                                                                • size
                                                                                                                                                                                                                                                                                                                                                                                • MemoryMode
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#sparkstorageunrollmemorythreshold","title":"spark.storage.unrollMemoryThreshold

                                                                                                                                                                                                                                                                                                                                                                                MemoryStore uses spark.storage.unrollMemoryThreshold configuration property when requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                • putIterator
                                                                                                                                                                                                                                                                                                                                                                                • putIteratorAsBytes
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#evicting-blocks","title":"Evicting Blocks
                                                                                                                                                                                                                                                                                                                                                                                evictBlocksToFreeSpace(\n  blockId: Option[BlockId],\n  space: Long,\n  memoryMode: MemoryMode): Long\n

                                                                                                                                                                                                                                                                                                                                                                                evictBlocksToFreeSpace finds blocks to evict in the entries registry (based on least-recently accessed order and until the required space to free up is met or there are no more blocks).

                                                                                                                                                                                                                                                                                                                                                                                Once done, evictBlocksToFreeSpace returns the memory freed up.

                                                                                                                                                                                                                                                                                                                                                                                When there is enough blocks to drop to free up memory, evictBlocksToFreeSpace prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                [n] blocks selected for dropping ([freedMemory]) bytes)\n

                                                                                                                                                                                                                                                                                                                                                                                evictBlocksToFreeSpace drops the blocks one by one.

                                                                                                                                                                                                                                                                                                                                                                                evictBlocksToFreeSpace prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                After dropping [n] blocks, free memory is [memory]\n

                                                                                                                                                                                                                                                                                                                                                                                When there is not enough blocks to drop to make room for the given block (if any), evictBlocksToFreeSpace prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                Will not store [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                                evictBlocksToFreeSpace\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                • StorageMemoryPool is requested to acquire memory and free up space to shrink pool
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#dropping-block","title":"Dropping Block
                                                                                                                                                                                                                                                                                                                                                                                dropBlock[T](\n  blockId: BlockId,\n  entry: MemoryEntry[T]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                dropBlock requests the BlockEvictionHandler to drop the block from memory.

                                                                                                                                                                                                                                                                                                                                                                                If the block is no longer available in any other store, dropBlock requests the BlockInfoManager to remove the block (info).

                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#blockinfomanager","title":"BlockInfoManager

                                                                                                                                                                                                                                                                                                                                                                                MemoryStore is given a BlockInfoManager when created.

                                                                                                                                                                                                                                                                                                                                                                                MemoryStore uses the BlockInfoManager when requested to evict blocks.

                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#accessing-memorystore","title":"Accessing MemoryStore

                                                                                                                                                                                                                                                                                                                                                                                MemoryStore is available to other Spark services using BlockManager.memoryStore.

                                                                                                                                                                                                                                                                                                                                                                                import org.apache.spark.SparkEnv\nSparkEnv.get.blockManager.memoryStore\n
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#serialized-block-bytes","title":"Serialized Block Bytes
                                                                                                                                                                                                                                                                                                                                                                                getBytes(\n  blockId: BlockId): Option[ChunkedByteBuffer]\n

                                                                                                                                                                                                                                                                                                                                                                                getBytes returns the bytes of the SerializedMemoryEntry of a block (if found in the entries registry).

                                                                                                                                                                                                                                                                                                                                                                                getBytes is used (for blocks with a serialized and in-memory storage level) when:

                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested for the serialized bytes of a block (from a local block manager), getLocalValues, maybeCacheDiskBytesInMemory
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#fetching-deserialized-block-values","title":"Fetching Deserialized Block Values
                                                                                                                                                                                                                                                                                                                                                                                getValues(\n  blockId: BlockId): Option[Iterator[_]]\n

                                                                                                                                                                                                                                                                                                                                                                                getValues returns the values of the DeserializedMemoryEntry of the given block (if available in the entries registry).

                                                                                                                                                                                                                                                                                                                                                                                getValues is used (for blocks with a deserialized and in-memory storage level) when:

                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested for the serialized bytes of a block (from a local block manager), getLocalValues, maybeCacheDiskBytesInMemory
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#putiteratorasbytes","title":"putIteratorAsBytes
                                                                                                                                                                                                                                                                                                                                                                                putIteratorAsBytes[T](\n  blockId: BlockId,\n  values: Iterator[T],\n  classTag: ClassTag[T],\n  memoryMode: MemoryMode): Either[PartiallySerializedBlock[T], Long]\n

                                                                                                                                                                                                                                                                                                                                                                                putIteratorAsBytes requires that the block is not already stored.

                                                                                                                                                                                                                                                                                                                                                                                putIteratorAsBytes putIterator (with the given BlockId, the values, the MemoryMode and a new SerializedValuesHolder).

                                                                                                                                                                                                                                                                                                                                                                                If successful, putIteratorAsBytes returns the estimated size of the block. Otherwise, a PartiallySerializedBlock.

                                                                                                                                                                                                                                                                                                                                                                                putIteratorAsBytes prints out the following WARN message to the logs when the initial memory threshold is too large:

                                                                                                                                                                                                                                                                                                                                                                                Initial memory threshold of [initialMemoryThreshold] is too large to be set as chunk size.\nChunk size has been capped to \"MAX_ROUNDED_ARRAY_LENGTH\"\n

                                                                                                                                                                                                                                                                                                                                                                                putIteratorAsBytes\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to doPutIterator (for a block with StorageLevel with useMemory and serialized)
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#putiteratorasvalues","title":"putIteratorAsValues
                                                                                                                                                                                                                                                                                                                                                                                putIteratorAsValues[T](\n  blockId: BlockId,\n  values: Iterator[T],\n  memoryMode: MemoryMode,\n  classTag: ClassTag[T]): Either[PartiallyUnrolledIterator[T], Long]\n

                                                                                                                                                                                                                                                                                                                                                                                putIteratorAsValues putIterator (with the given BlockId, the values, the MemoryMode and a new DeserializedValuesHolder).

                                                                                                                                                                                                                                                                                                                                                                                If successful, putIteratorAsValues returns the estimated size of the block. Otherwise, a PartiallyUnrolledIterator.

                                                                                                                                                                                                                                                                                                                                                                                putIteratorAsValues\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                • BlockStoreUpdater is requested to saveDeserializedValuesToMemoryStore
                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to doPutIterator and maybeCacheDiskValuesInMemory
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#putiterator","title":"putIterator
                                                                                                                                                                                                                                                                                                                                                                                putIterator[T](\n  blockId: BlockId,\n  values: Iterator[T],\n  classTag: ClassTag[T],\n  memoryMode: MemoryMode,\n  valuesHolder: ValuesHolder[T]): Either[Long, Long]\n

                                                                                                                                                                                                                                                                                                                                                                                putIterator returns the (estimated) size of the block (as Right) or the unrollMemoryUsedByThisBlock (as Left).

                                                                                                                                                                                                                                                                                                                                                                                putIterator requires that the block is not already in the MemoryStore.

                                                                                                                                                                                                                                                                                                                                                                                putIterator reserveUnrollMemoryForThisTask (with the spark.storage.unrollMemoryThreshold for the initial memory threshold).

                                                                                                                                                                                                                                                                                                                                                                                If putIterator did not manage to reserve the memory for unrolling (computing block in memory), it prints out the following WARN message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                Failed to reserve initial memory threshold of [initialMemoryThreshold]\nfor computing block [blockId] in memory.\n

                                                                                                                                                                                                                                                                                                                                                                                putIterator requests the ValuesHolder to storeValue for every value in the given values iterator. putIterator checks memory usage regularly (whether it may have exceeded the threshold) and reserveUnrollMemoryForThisTask when needed.

                                                                                                                                                                                                                                                                                                                                                                                putIterator requests the ValuesHolder for a MemoryEntryBuilder (getBuilder) that in turn is requested to build a MemoryEntry.

                                                                                                                                                                                                                                                                                                                                                                                putIterator releaseUnrollMemoryForThisTask.

                                                                                                                                                                                                                                                                                                                                                                                putIterator requests the MemoryManager to acquireStorageMemory and stores the block (in the entries registry).

                                                                                                                                                                                                                                                                                                                                                                                In the end, putIterator prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                Block [blockId] stored as values in memory (estimated size [size], free [free])\n

                                                                                                                                                                                                                                                                                                                                                                                In case of putIterator not having enough memory to store the block, putIterator logUnrollFailureMessage and returns the unrollMemoryUsedByThisBlock.

                                                                                                                                                                                                                                                                                                                                                                                putIterator\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                • MemoryStore is requested to putIteratorAsValues and putIteratorAsBytes
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#logunrollfailuremessage","title":"logUnrollFailureMessage
                                                                                                                                                                                                                                                                                                                                                                                logUnrollFailureMessage(\n  blockId: BlockId,\n  finalVectorSize: Long): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                logUnrollFailureMessage prints out the following WARN message to the logs and logMemoryUsage.

                                                                                                                                                                                                                                                                                                                                                                                Not enough space to cache [blockId] in memory! (computed [size] so far)\n
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#logmemoryusage","title":"logMemoryUsage
                                                                                                                                                                                                                                                                                                                                                                                logMemoryUsage(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                logMemoryUsage prints out the following INFO message to the logs (with the blocksMemoryUsed, currentUnrollMemory, numTasksUnrolling, memoryUsed, and maxMemory):

                                                                                                                                                                                                                                                                                                                                                                                Memory use = [blocksMemoryUsed] (blocks) + [currentUnrollMemory]\n(scratch space shared across [numTasksUnrolling] tasks(s)) = [memoryUsed].\nStorage limit = [maxMemory].\n
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#storing-block","title":"Storing Block
                                                                                                                                                                                                                                                                                                                                                                                putBytes[T: ClassTag](\n  blockId: BlockId,\n  size: Long,\n  memoryMode: MemoryMode,\n  _bytes: () => ChunkedByteBuffer): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                putBytes returns true only after there was enough memory to store the block (BlockId) in entries registry.

                                                                                                                                                                                                                                                                                                                                                                                putBytes asserts that the block is not stored yet.

                                                                                                                                                                                                                                                                                                                                                                                putBytes requests the MemoryManager for memory (to store the block) and, when successful, adds the block to the entries registry (as a SerializedMemoryEntry with the _bytes and the MemoryMode).

                                                                                                                                                                                                                                                                                                                                                                                In the end, putBytes prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                Block [blockId] stored as bytes in memory (estimated size [size], free [size])\n

                                                                                                                                                                                                                                                                                                                                                                                putBytes is used when:

                                                                                                                                                                                                                                                                                                                                                                                • BlockStoreUpdater is requested to save serialized values (to MemoryStore)
                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to maybeCacheDiskBytesInMemory
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#memory-used-for-caching-blocks","title":"Memory Used for Caching Blocks
                                                                                                                                                                                                                                                                                                                                                                                blocksMemoryUsed: Long\n

                                                                                                                                                                                                                                                                                                                                                                                blocksMemoryUsed is the total memory used without (minus) the memory used for unrolling.

                                                                                                                                                                                                                                                                                                                                                                                blocksMemoryUsed is used for logging purposes (when MemoryStore is requested to putBytes, putIterator, remove, evictBlocksToFreeSpace and logMemoryUsage).

                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#total-storage-memory-in-use","title":"Total Storage Memory in Use
                                                                                                                                                                                                                                                                                                                                                                                memoryUsed: Long\n

                                                                                                                                                                                                                                                                                                                                                                                memoryUsed requests the MemoryManager for the total storage memory.

                                                                                                                                                                                                                                                                                                                                                                                memoryUsed is used when:

                                                                                                                                                                                                                                                                                                                                                                                • MemoryStore is requested for blocksMemoryUsed and to logMemoryUsage
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#maximum-storage-memory","title":"Maximum Storage Memory
                                                                                                                                                                                                                                                                                                                                                                                maxMemory: Long\n

                                                                                                                                                                                                                                                                                                                                                                                maxMemory is the total amount of memory available for storage (in bytes) and is the sum of the maxOnHeapStorageMemory and maxOffHeapStorageMemory of the MemoryManager.

                                                                                                                                                                                                                                                                                                                                                                                Tip

                                                                                                                                                                                                                                                                                                                                                                                Enable INFO logging for MemoryStore to print out the maxMemory to the logs when created:

                                                                                                                                                                                                                                                                                                                                                                                MemoryStore started with capacity [maxMemory] MB\n

                                                                                                                                                                                                                                                                                                                                                                                maxMemory is used when:

                                                                                                                                                                                                                                                                                                                                                                                • MemoryStore is requested for the blocksMemoryUsed and to logMemoryUsage
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#dropping-block-from-memory","title":"Dropping Block from Memory
                                                                                                                                                                                                                                                                                                                                                                                remove(\n  blockId: BlockId): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                remove returns true when the given block (BlockId) was (found and) removed from the entries registry successfully and the memory released (from the MemoryManager).

                                                                                                                                                                                                                                                                                                                                                                                remove removes (drops) the block (BlockId) from the entries registry.

                                                                                                                                                                                                                                                                                                                                                                                If found and removed, remove requests the MemoryManager to releaseStorageMemory and prints out the following DEBUG message to the logs (with the maxMemory and blocksMemoryUsed):

                                                                                                                                                                                                                                                                                                                                                                                Block [blockId] of size [size] dropped from memory (free [memory])\n

                                                                                                                                                                                                                                                                                                                                                                                remove\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to dropFromMemory and removeBlockInternal
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#releasing-unroll-memory-for-task","title":"Releasing Unroll Memory for Task
                                                                                                                                                                                                                                                                                                                                                                                releaseUnrollMemoryForThisTask(\n  memoryMode: MemoryMode,\n  memory: Long = Long.MaxValue): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                releaseUnrollMemoryForThisTask finds the task attempt ID of the current task.

                                                                                                                                                                                                                                                                                                                                                                                releaseUnrollMemoryForThisTask uses the onHeapUnrollMemoryMap or offHeapUnrollMemoryMap based on the given MemoryMode.

                                                                                                                                                                                                                                                                                                                                                                                (Only when the unroll memory map contains the task attempt ID) releaseUnrollMemoryForThisTask descreases the memory registered in the unroll memory map by the given memory amount and requests the MemoryManager to releaseUnrollMemory. In the end, releaseUnrollMemoryForThisTask removes the task attempt ID (entry) from the unroll memory map if the memory used is 0.

                                                                                                                                                                                                                                                                                                                                                                                releaseUnrollMemoryForThisTask\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                • Task is requested to run (and is about to finish)
                                                                                                                                                                                                                                                                                                                                                                                • MemoryStore is requested to putIterator
                                                                                                                                                                                                                                                                                                                                                                                • PartiallyUnrolledIterator is requested to releaseUnrollMemory
                                                                                                                                                                                                                                                                                                                                                                                • PartiallySerializedBlock is requested to discard and finishWritingToStream
                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/MemoryStore/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                Enable ALL logging level for org.apache.spark.storage.memory.MemoryStore logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                log4j.logger.org.apache.spark.storage.memory.MemoryStore=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/NettyBlockRpcServer/","title":"NettyBlockRpcServer","text":"

                                                                                                                                                                                                                                                                                                                                                                                NettyBlockRpcServer is a RpcHandler to handle messages for NettyBlockTransferService.

                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/NettyBlockRpcServer/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                NettyBlockRpcServer takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                • Application ID
                                                                                                                                                                                                                                                                                                                                                                                • Serializer
                                                                                                                                                                                                                                                                                                                                                                                • BlockDataManager

                                                                                                                                                                                                                                                                                                                                                                                  NettyBlockRpcServer is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                  • NettyBlockTransferService is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"storage/NettyBlockRpcServer/#oneforonestreammanager","title":"OneForOneStreamManager

                                                                                                                                                                                                                                                                                                                                                                                  NettyBlockRpcServer uses a OneForOneStreamManager.

                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/NettyBlockRpcServer/#receiving-rpc-messages","title":"Receiving RPC Messages
                                                                                                                                                                                                                                                                                                                                                                                  receive(\n  client: TransportClient,\n  rpcMessage: ByteBuffer,\n  responseContext: RpcResponseCallback): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                  receive deserializes the incoming RPC message (from ByteBuffer to BlockTransferMessage) and prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                  Received request: [message]\n

                                                                                                                                                                                                                                                                                                                                                                                  receive handles the message.

                                                                                                                                                                                                                                                                                                                                                                                  receive\u00a0is part of the RpcHandler abstraction.

                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/NettyBlockRpcServer/#fetchshuffleblocks","title":"FetchShuffleBlocks

                                                                                                                                                                                                                                                                                                                                                                                  FetchShuffleBlocks carries the following:

                                                                                                                                                                                                                                                                                                                                                                                  • Application ID
                                                                                                                                                                                                                                                                                                                                                                                  • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                  • Shuffle ID
                                                                                                                                                                                                                                                                                                                                                                                  • Map IDs (long[])
                                                                                                                                                                                                                                                                                                                                                                                  • Reduce IDs (long[][])
                                                                                                                                                                                                                                                                                                                                                                                  • batchFetchEnabled flag

                                                                                                                                                                                                                                                                                                                                                                                  When received, receive...FIXME

                                                                                                                                                                                                                                                                                                                                                                                  receive prints out the following TRACE message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                  Registered streamId [streamId] with [numBlockIds] buffers\n

                                                                                                                                                                                                                                                                                                                                                                                  In the end, receive responds with a StreamHandle (with the streamId and the number of blocks). The response is serialized to a ByteBuffer.

                                                                                                                                                                                                                                                                                                                                                                                  FetchShuffleBlocks is posted when:

                                                                                                                                                                                                                                                                                                                                                                                  • OneForOneBlockFetcher is requested to createFetchShuffleBlocksMsgAndBuildBlockIds
                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/NettyBlockRpcServer/#getlocaldirsforexecutors","title":"GetLocalDirsForExecutors","text":""},{"location":"storage/NettyBlockRpcServer/#openblocks","title":"OpenBlocks

                                                                                                                                                                                                                                                                                                                                                                                  OpenBlocks carries the following:

                                                                                                                                                                                                                                                                                                                                                                                  • Application ID
                                                                                                                                                                                                                                                                                                                                                                                  • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                  • Block IDs

                                                                                                                                                                                                                                                                                                                                                                                  When received, receive...FIXME

                                                                                                                                                                                                                                                                                                                                                                                  receive prints out the following TRACE message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                  Registered streamId [streamId] with [blocksNum] buffers\n

                                                                                                                                                                                                                                                                                                                                                                                  In the end, receive responds with a StreamHandle (with the streamId and the number of blocks). The response is serialized to a ByteBuffer.

                                                                                                                                                                                                                                                                                                                                                                                  OpenBlocks is posted when:

                                                                                                                                                                                                                                                                                                                                                                                  • OneForOneBlockFetcher is requested to start
                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/NettyBlockRpcServer/#uploadblock","title":"UploadBlock

                                                                                                                                                                                                                                                                                                                                                                                  UploadBlock carries the following:

                                                                                                                                                                                                                                                                                                                                                                                  • Application ID
                                                                                                                                                                                                                                                                                                                                                                                  • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                  • Block ID
                                                                                                                                                                                                                                                                                                                                                                                  • Metadata (byte[])
                                                                                                                                                                                                                                                                                                                                                                                  • Block Data (byte[])

                                                                                                                                                                                                                                                                                                                                                                                  When received, receive deserializes the metadata to get the StorageLevel and ClassTag of the block being uploaded.

                                                                                                                                                                                                                                                                                                                                                                                  receive...FIXME

                                                                                                                                                                                                                                                                                                                                                                                  UploadBlock is posted when:

                                                                                                                                                                                                                                                                                                                                                                                  • NettyBlockTransferService is requested to upload a block
                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/NettyBlockRpcServer/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                  Enable ALL logging level for org.apache.spark.network.netty.NettyBlockRpcServer logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                  Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                  log4j.logger.org.apache.spark.network.netty.NettyBlockRpcServer=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                  Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/NettyBlockTransferService/","title":"NettyBlockTransferService","text":"

                                                                                                                                                                                                                                                                                                                                                                                  NettyBlockTransferService is a BlockTransferService that uses Netty for uploading and fetching blocks of data.

                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"storage/NettyBlockTransferService/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                  NettyBlockTransferService takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                  • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                  • SecurityManager
                                                                                                                                                                                                                                                                                                                                                                                  • Bind Address
                                                                                                                                                                                                                                                                                                                                                                                  • Host Name
                                                                                                                                                                                                                                                                                                                                                                                  • Port
                                                                                                                                                                                                                                                                                                                                                                                  • Number of CPU Cores
                                                                                                                                                                                                                                                                                                                                                                                  • Driver RpcEndpointRef

                                                                                                                                                                                                                                                                                                                                                                                    NettyBlockTransferService is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                    • SparkEnv utility is used to create a SparkEnv (for the driver and executors and creates a BlockManager)
                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"storage/NettyBlockTransferService/#initializing","title":"Initializing
                                                                                                                                                                                                                                                                                                                                                                                    init(\n  blockDataManager: BlockDataManager): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                    init\u00a0is part of the BlockTransferService abstraction.

                                                                                                                                                                                                                                                                                                                                                                                    init creates a NettyBlockRpcServer (with the application ID, a JavaSerializer and the given BlockDataManager).

                                                                                                                                                                                                                                                                                                                                                                                    init creates a TransportContext (with the NettyBlockRpcServer just created) and requests it for a TransportClientFactory.

                                                                                                                                                                                                                                                                                                                                                                                    init createServer.

                                                                                                                                                                                                                                                                                                                                                                                    In the end, init prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                    Server created on [hostName]:[port]\n
                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/NettyBlockTransferService/#fetching-blocks","title":"Fetching Blocks
                                                                                                                                                                                                                                                                                                                                                                                    fetchBlocks(\n  host: String,\n  port: Int,\n  execId: String,\n  blockIds: Array[String],\n  listener: BlockFetchingListener,\n  tempFileManager: DownloadFileManager): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                    fetchBlocks prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                    Fetch blocks from [host]:[port] (executor id [execId])\n

                                                                                                                                                                                                                                                                                                                                                                                    fetchBlocks requests the TransportConf for the maxIORetries.

                                                                                                                                                                                                                                                                                                                                                                                    fetchBlocks creates a BlockTransferStarter.

                                                                                                                                                                                                                                                                                                                                                                                    With the maxIORetries above zero, fetchBlocks creates a RetryingBlockFetcher (with the BlockFetchStarter, the blockIds and the BlockFetchingListener) and starts it.

                                                                                                                                                                                                                                                                                                                                                                                    Otherwise, fetchBlocks requests the BlockFetchStarter to createAndStart (with the blockIds and the BlockFetchingListener).

                                                                                                                                                                                                                                                                                                                                                                                    In case of any Exception, fetchBlocks prints out the following ERROR message to the logs and the given BlockFetchingListener gets notified.

                                                                                                                                                                                                                                                                                                                                                                                    Exception while beginning fetchBlocks\n

                                                                                                                                                                                                                                                                                                                                                                                    fetchBlocks\u00a0is part of the BlockStoreClient abstraction.

                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/NettyBlockTransferService/#blocktransferstarter","title":"BlockTransferStarter

                                                                                                                                                                                                                                                                                                                                                                                    fetchBlocks creates a BlockTransferStarter. When requested to createAndStart, the BlockTransferStarter requests the TransportClientFactory to create a TransportClient.

                                                                                                                                                                                                                                                                                                                                                                                    createAndStart creates an OneForOneBlockFetcher and requests it to start.

                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/NettyBlockTransferService/#ioexception","title":"IOException

                                                                                                                                                                                                                                                                                                                                                                                    In case of an IOException, createAndStart requests the driver RpcEndpointRef to send an IsExecutorAlive message synchronously (with the given execId).

                                                                                                                                                                                                                                                                                                                                                                                    If the driver RpcEndpointRef replied false, createAndStart throws an ExecutorDeadException:

                                                                                                                                                                                                                                                                                                                                                                                    The relative remote executor(Id: [execId]),\nwhich maintains the block data to fetch is dead.\n

                                                                                                                                                                                                                                                                                                                                                                                    Otherwise, createAndStart (re)throws the IOException.

                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/NettyBlockTransferService/#uploading-block","title":"Uploading Block
                                                                                                                                                                                                                                                                                                                                                                                    uploadBlock(\n  hostname: String,\n  port: Int,\n  execId: String,\n  blockId: BlockId,\n  blockData: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Future[Unit]\n

                                                                                                                                                                                                                                                                                                                                                                                    uploadBlock\u00a0is part of the BlockTransferService abstraction.

                                                                                                                                                                                                                                                                                                                                                                                    uploadBlock creates a TransportClient (with the given hostname and port).

                                                                                                                                                                                                                                                                                                                                                                                    uploadBlock serializes the given StorageLevel and ClassTag (using a JavaSerializer).

                                                                                                                                                                                                                                                                                                                                                                                    uploadBlock uses a stream to transfer shuffle blocks when one of the following holds:

                                                                                                                                                                                                                                                                                                                                                                                    1. The size of the block data (ManagedBuffer) is above spark.network.maxRemoteBlockSizeFetchToMem configuration property
                                                                                                                                                                                                                                                                                                                                                                                    2. The given BlockId is a shuffle block

                                                                                                                                                                                                                                                                                                                                                                                    For stream transfer uploadBlock requests the TransportClient to uploadStream. Otherwise, uploadBlock requests the TransportClient to sendRpc a UploadBlock message.

                                                                                                                                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                                                                                                                                    UploadBlock message is processed by NettyBlockRpcServer.

                                                                                                                                                                                                                                                                                                                                                                                    With the upload successful, uploadBlock prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                    Successfully uploaded block [blockId] [as stream]\n

                                                                                                                                                                                                                                                                                                                                                                                    With the upload failed, uploadBlock prints out the following ERROR message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                    Error while uploading block [blockId] [as stream]\n
                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/NettyBlockTransferService/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                    Enable ALL logging level for org.apache.spark.network.netty.NettyBlockTransferService logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                    Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                    log4j.logger.org.apache.spark.network.netty.NettyBlockTransferService=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                    Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/OneForOneBlockFetcher/","title":"OneForOneBlockFetcher","text":""},{"location":"storage/OneForOneBlockFetcher/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                    OneForOneBlockFetcher takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                    • TransportClient
                                                                                                                                                                                                                                                                                                                                                                                    • Application ID
                                                                                                                                                                                                                                                                                                                                                                                    • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                    • Block IDs (String[])
                                                                                                                                                                                                                                                                                                                                                                                    • BlockFetchingListener
                                                                                                                                                                                                                                                                                                                                                                                    • TransportConf
                                                                                                                                                                                                                                                                                                                                                                                    • DownloadFileManager

                                                                                                                                                                                                                                                                                                                                                                                      OneForOneBlockFetcher is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                      • NettyBlockTransferService is requested to fetch blocks
                                                                                                                                                                                                                                                                                                                                                                                      • ExternalBlockStoreClient is requested to fetch blocks
                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"storage/OneForOneBlockFetcher/#createfetchshuffleblocksmsg","title":"createFetchShuffleBlocksMsg
                                                                                                                                                                                                                                                                                                                                                                                      FetchShuffleBlocks createFetchShuffleBlocksMsg(\n  String appId,\n  String execId,\n  String[] blockIds)\n

                                                                                                                                                                                                                                                                                                                                                                                      createFetchShuffleBlocksMsg...FIXME

                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/OneForOneBlockFetcher/#starting","title":"Starting
                                                                                                                                                                                                                                                                                                                                                                                      void start()\n

                                                                                                                                                                                                                                                                                                                                                                                      start requests the TransportClient to sendRpc the BlockTransferMessage

                                                                                                                                                                                                                                                                                                                                                                                      start...FIXME

                                                                                                                                                                                                                                                                                                                                                                                      start\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                      • ExternalBlockStoreClient is requested to fetchBlocks
                                                                                                                                                                                                                                                                                                                                                                                      • NettyBlockTransferService is requested to fetchBlocks
                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/OneForOneBlockFetcher/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                      Enable ALL logging level for org.apache.spark.network.shuffle.OneForOneBlockFetcher logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                      Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                      log4j.logger.org.apache.spark.network.shuffle.OneForOneBlockFetcher=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                      Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/RDDInfo/","title":"RDDInfo","text":"

                                                                                                                                                                                                                                                                                                                                                                                      RDDInfo is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"storage/RandomBlockReplicationPolicy/","title":"RandomBlockReplicationPolicy","text":"

                                                                                                                                                                                                                                                                                                                                                                                      RandomBlockReplicationPolicy is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"storage/ShuffleBlockFetcherIterator/","title":"ShuffleBlockFetcherIterator","text":"

                                                                                                                                                                                                                                                                                                                                                                                      ShuffleBlockFetcherIterator is an Iterator[(BlockId, InputStream)] (Scala) that fetches shuffle blocks from local or remote BlockManagers (and makes them available as an InputStream).

                                                                                                                                                                                                                                                                                                                                                                                      ShuffleBlockFetcherIterator allows for a synchronous iteration over shuffle blocks so a caller can handle them in a pipelined fashion as they are received.

                                                                                                                                                                                                                                                                                                                                                                                      ShuffleBlockFetcherIterator is exhausted (and can provide no elements) when the number of blocks already processed is at least the total number of blocks to fetch.

                                                                                                                                                                                                                                                                                                                                                                                      ShuffleBlockFetcherIterator throttles the remote fetches to avoid consuming too much memory.

                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"storage/ShuffleBlockFetcherIterator/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                      ShuffleBlockFetcherIterator takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                      • TaskContext
                                                                                                                                                                                                                                                                                                                                                                                      • BlockStoreClient
                                                                                                                                                                                                                                                                                                                                                                                      • BlockManager
                                                                                                                                                                                                                                                                                                                                                                                      • Blocks to Fetch by Address (Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])])
                                                                                                                                                                                                                                                                                                                                                                                      • Stream Wrapper Function ((BlockId, InputStream) => InputStream)
                                                                                                                                                                                                                                                                                                                                                                                      • spark.reducer.maxSizeInFlight
                                                                                                                                                                                                                                                                                                                                                                                      • spark.reducer.maxReqsInFlight
                                                                                                                                                                                                                                                                                                                                                                                      • spark.reducer.maxBlocksInFlightPerAddress
                                                                                                                                                                                                                                                                                                                                                                                      • spark.network.maxRemoteBlockSizeFetchToMem
                                                                                                                                                                                                                                                                                                                                                                                      • spark.shuffle.detectCorrupt
                                                                                                                                                                                                                                                                                                                                                                                      • spark.shuffle.detectCorrupt.useExtraMemory
                                                                                                                                                                                                                                                                                                                                                                                      • ShuffleReadMetricsReporter
                                                                                                                                                                                                                                                                                                                                                                                      • doBatchFetch flag

                                                                                                                                                                                                                                                                                                                                                                                        While being created, ShuffleBlockFetcherIterator initializes itself.

                                                                                                                                                                                                                                                                                                                                                                                        ShuffleBlockFetcherIterator is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                        • BlockStoreShuffleReader is requested to read combined key-value records for a reduce task
                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/ShuffleBlockFetcherIterator/#initializing","title":"Initializing
                                                                                                                                                                                                                                                                                                                                                                                        initialize(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                        initialize registers a task cleanup and fetches shuffle blocks from remote and local storage:BlockManager.md[BlockManagers].

                                                                                                                                                                                                                                                                                                                                                                                        Internally, initialize uses the TaskContext to register the ShuffleFetchCompletionListener (to cleanup).

                                                                                                                                                                                                                                                                                                                                                                                        initialize partitionBlocksByFetchMode.

                                                                                                                                                                                                                                                                                                                                                                                        initialize...FIXME

                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#partitionblocksbyfetchmode","title":"partitionBlocksByFetchMode
                                                                                                                                                                                                                                                                                                                                                                                        partitionBlocksByFetchMode(): ArrayBuffer[FetchRequest]\n

                                                                                                                                                                                                                                                                                                                                                                                        partitionBlocksByFetchMode...FIXME

                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#collectfetchrequests","title":"collectFetchRequests
                                                                                                                                                                                                                                                                                                                                                                                        collectFetchRequests(\n  address: BlockManagerId,\n  blockInfos: Seq[(BlockId, Long, Int)],\n  collectedRemoteRequests: ArrayBuffer[FetchRequest]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                        collectFetchRequests...FIXME

                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#createfetchrequests","title":"createFetchRequests
                                                                                                                                                                                                                                                                                                                                                                                        createFetchRequests(\n  curBlocks: Seq[FetchBlockInfo],\n  address: BlockManagerId,\n  isLast: Boolean,\n  collectedRemoteRequests: ArrayBuffer[FetchRequest]): Seq[FetchBlockInfo]\n

                                                                                                                                                                                                                                                                                                                                                                                        createFetchRequests...FIXME

                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#fetchuptomaxbytes","title":"fetchUpToMaxBytes
                                                                                                                                                                                                                                                                                                                                                                                        fetchUpToMaxBytes(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                        fetchUpToMaxBytes...FIXME

                                                                                                                                                                                                                                                                                                                                                                                        fetchUpToMaxBytes is used when:

                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleBlockFetcherIterator is requested to initialize and next
                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#sending-remote-shuffle-block-fetch-request","title":"Sending Remote Shuffle Block Fetch Request
                                                                                                                                                                                                                                                                                                                                                                                        sendRequest(\n  req: FetchRequest): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                        sendRequest prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                        Sending request for [n] blocks ([size]) from [hostPort]\n

                                                                                                                                                                                                                                                                                                                                                                                        sendRequest add the size of the blocks in the FetchRequest to bytesInFlight and increments the reqsInFlight internal counters.

                                                                                                                                                                                                                                                                                                                                                                                        sendRequest requests the ShuffleClient to fetch the blocks with a new BlockFetchingListener (and this ShuffleBlockFetcherIterator when the size of the blocks in the FetchRequest is higher than the maxReqSizeShuffleToMem).

                                                                                                                                                                                                                                                                                                                                                                                        sendRequest is used when:

                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleBlockFetcherIterator is requested to fetch remote shuffle blocks
                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#blockfetchinglistener","title":"BlockFetchingListener

                                                                                                                                                                                                                                                                                                                                                                                        sendRequest creates a new BlockFetchingListener to be notified about successes or failures of shuffle block fetch requests.

                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#onblockfetchsuccess","title":"onBlockFetchSuccess

                                                                                                                                                                                                                                                                                                                                                                                        On onBlockFetchSuccess the BlockFetchingListener adds a SuccessFetchResult to the results registry and prints out the following DEBUG message to the logs (when not a zombie):

                                                                                                                                                                                                                                                                                                                                                                                        remainingBlocks: [remainingBlocks]\n

                                                                                                                                                                                                                                                                                                                                                                                        In the end, onBlockFetchSuccess prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                        Got remote block [blockId] after [time]\n
                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#onblockfetchfailure","title":"onBlockFetchFailure

                                                                                                                                                                                                                                                                                                                                                                                        On onBlockFetchFailure the BlockFetchingListener adds a FailureFetchResult to the results registry and prints out the following ERROR message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                        Failed to get block(s) from [host]:[port]\n
                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#fetchresults","title":"FetchResults
                                                                                                                                                                                                                                                                                                                                                                                        results: LinkedBlockingQueue[FetchResult]\n

                                                                                                                                                                                                                                                                                                                                                                                        ShuffleBlockFetcherIterator uses an internal FIFO blocking queue (Java) of FetchResults.

                                                                                                                                                                                                                                                                                                                                                                                        results is used for fetching the next element.

                                                                                                                                                                                                                                                                                                                                                                                        For remote blocks, FetchResults are added in sendRequest:

                                                                                                                                                                                                                                                                                                                                                                                        • SuccessFetchResults after a BlockFetchingListener is notified about onBlockFetchSuccess
                                                                                                                                                                                                                                                                                                                                                                                        • FailureFetchResults after a BlockFetchingListener is notified about onBlockFetchFailure

                                                                                                                                                                                                                                                                                                                                                                                        For local blocks, FetchResults are added in fetchLocalBlocks:

                                                                                                                                                                                                                                                                                                                                                                                        • SuccessFetchResults after the BlockManager has successfully getLocalBlockData
                                                                                                                                                                                                                                                                                                                                                                                        • FailureFetchResults otherwise

                                                                                                                                                                                                                                                                                                                                                                                        For local blocks, FetchResults are added in fetchHostLocalBlock:

                                                                                                                                                                                                                                                                                                                                                                                        • SuccessFetchResults after the BlockManager has successfully getHostLocalShuffleData
                                                                                                                                                                                                                                                                                                                                                                                        • FailureFetchResults otherwise

                                                                                                                                                                                                                                                                                                                                                                                        FailureFetchResults can also be added in fetchHostLocalBlocks.

                                                                                                                                                                                                                                                                                                                                                                                        Cleaned up in cleanup

                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#hasnext","title":"hasNext
                                                                                                                                                                                                                                                                                                                                                                                        hasNext: Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                        hasNext\u00a0is part of the Iterator (Scala) abstraction (to test whether this iterator can provide another element).

                                                                                                                                                                                                                                                                                                                                                                                        hasNext is true when numBlocksProcessed is below numBlocksToFetch.

                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#retrieving-next-element","title":"Retrieving Next Element
                                                                                                                                                                                                                                                                                                                                                                                        next(): (BlockId, InputStream)\n

                                                                                                                                                                                                                                                                                                                                                                                        next increments the numBlocksProcessed registry.

                                                                                                                                                                                                                                                                                                                                                                                        next takes (and removes) the head of the results queue.

                                                                                                                                                                                                                                                                                                                                                                                        next requests the ShuffleReadMetricsReporter to incFetchWaitTime.

                                                                                                                                                                                                                                                                                                                                                                                        next...FIXME

                                                                                                                                                                                                                                                                                                                                                                                        next throws a NoSuchElementException if there is no element left.

                                                                                                                                                                                                                                                                                                                                                                                        next is part of the Iterator (Scala) abstraction (to produce the next element of this iterator).

                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#numblocksprocessed","title":"numBlocksProcessed

                                                                                                                                                                                                                                                                                                                                                                                        The number of blocks fetched and consumed

                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#numblockstofetch","title":"numBlocksToFetch

                                                                                                                                                                                                                                                                                                                                                                                        Total number of blocks to fetch and consume

                                                                                                                                                                                                                                                                                                                                                                                        ShuffleBlockFetcherIterator can produce up to numBlocksToFetch elements.

                                                                                                                                                                                                                                                                                                                                                                                        numBlocksToFetch is increased every time ShuffleBlockFetcherIterator is requested to partitionBlocksByFetchMode that prints it out as the INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                        Getting [numBlocksToFetch] non-empty blocks out of [totalBlocks] blocks\n
                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#releasecurrentresultbuffer","title":"releaseCurrentResultBuffer
                                                                                                                                                                                                                                                                                                                                                                                        releaseCurrentResultBuffer(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                        releaseCurrentResultBuffer...FIXME

                                                                                                                                                                                                                                                                                                                                                                                        releaseCurrentResultBuffer\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleBlockFetcherIterator is requested to cleanup
                                                                                                                                                                                                                                                                                                                                                                                        • BufferReleasingInputStream is requested to close
                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#shufflefetchcompletionlistener","title":"ShuffleFetchCompletionListener

                                                                                                                                                                                                                                                                                                                                                                                        ShuffleBlockFetcherIterator creates a ShuffleFetchCompletionListener when created.

                                                                                                                                                                                                                                                                                                                                                                                        ShuffleFetchCompletionListener is used when initialize and toCompletionIterator.

                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#cleaning-up","title":"Cleaning Up
                                                                                                                                                                                                                                                                                                                                                                                        cleanup(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                        cleanup marks this ShuffleBlockFetcherIterator a zombie.

                                                                                                                                                                                                                                                                                                                                                                                        cleanup releases the current result buffer.

                                                                                                                                                                                                                                                                                                                                                                                        cleanup iterates over results internal queue and for every SuccessFetchResult, increments remote bytes read and blocks fetched shuffle task metrics, and eventually releases the managed buffer.

                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#bytesinflight","title":"bytesInFlight

                                                                                                                                                                                                                                                                                                                                                                                        The bytes of fetched remote shuffle blocks in flight

                                                                                                                                                                                                                                                                                                                                                                                        Starts at 0 when ShuffleBlockFetcherIterator is created

                                                                                                                                                                                                                                                                                                                                                                                        Incremented every sendRequest and decremented every next.

                                                                                                                                                                                                                                                                                                                                                                                        ShuffleBlockFetcherIterator makes sure that the invariant of bytesInFlight is below maxBytesInFlight every remote shuffle block fetch.

                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#reqsinflight","title":"reqsInFlight

                                                                                                                                                                                                                                                                                                                                                                                        The number of remote shuffle block fetch requests in flight.

                                                                                                                                                                                                                                                                                                                                                                                        Starts at 0 when ShuffleBlockFetcherIterator is created

                                                                                                                                                                                                                                                                                                                                                                                        Incremented every sendRequest and decremented every next.

                                                                                                                                                                                                                                                                                                                                                                                        ShuffleBlockFetcherIterator makes sure that the invariant of reqsInFlight is below maxReqsInFlight every remote shuffle block fetch.

                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#iszombie","title":"isZombie

                                                                                                                                                                                                                                                                                                                                                                                        Controls whether ShuffleBlockFetcherIterator is still active and records SuccessFetchResults on successful shuffle block fetches.

                                                                                                                                                                                                                                                                                                                                                                                        Starts false when ShuffleBlockFetcherIterator is created

                                                                                                                                                                                                                                                                                                                                                                                        Enabled (true) in cleanup.

                                                                                                                                                                                                                                                                                                                                                                                        When enabled, registerTempFileToClean is a noop.

                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#downloadfilemanager","title":"DownloadFileManager

                                                                                                                                                                                                                                                                                                                                                                                        ShuffleBlockFetcherIterator is a DownloadFileManager.

                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#throwfetchfailedexception","title":"throwFetchFailedException
                                                                                                                                                                                                                                                                                                                                                                                        throwFetchFailedException(\n  blockId: BlockId,\n  mapIndex: Int,\n  address: BlockManagerId,\n  e: Throwable,\n  message: Option[String] = None): Nothing\n

                                                                                                                                                                                                                                                                                                                                                                                        throwFetchFailedException takes the message (if defined) or uses the message of the given Throwable.

                                                                                                                                                                                                                                                                                                                                                                                        In the end, throwFetchFailedException throws a FetchFailedException if the BlockId is either a ShuffleBlockId or a ShuffleBlockBatchId. Otherwise, throwFetchFailedException throws a SparkException:

                                                                                                                                                                                                                                                                                                                                                                                        Failed to get block [blockId], which is not a shuffle block\n

                                                                                                                                                                                                                                                                                                                                                                                        throwFetchFailedException\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleBlockFetcherIterator is requested to next
                                                                                                                                                                                                                                                                                                                                                                                        • BufferReleasingInputStream is requested to tryOrFetchFailedException
                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                        Enable ALL logging level for org.apache.spark.storage.ShuffleBlockFetcherIterator logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                        Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                        log4j.logger.org.apache.spark.storage.ShuffleBlockFetcherIterator=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                        Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ShuffleFetchCompletionListener/","title":"ShuffleFetchCompletionListener","text":"

                                                                                                                                                                                                                                                                                                                                                                                        ShuffleFetchCompletionListener is a TaskCompletionListener (that ShuffleBlockFetcherIterator uses to clean up after the owning task is completed).

                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/ShuffleFetchCompletionListener/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                        ShuffleFetchCompletionListener takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleBlockFetcherIterator

                                                                                                                                                                                                                                                                                                                                                                                          ShuffleFetchCompletionListener is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                          • ShuffleBlockFetcherIterator is created
                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"storage/ShuffleFetchCompletionListener/#ontaskcompletion","title":"onTaskCompletion
                                                                                                                                                                                                                                                                                                                                                                                          onTaskCompletion(\n  context: TaskContext): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                          onTaskCompletion\u00a0is part of the TaskCompletionListener abstraction.

                                                                                                                                                                                                                                                                                                                                                                                          onTaskCompletion requests the ShuffleBlockFetcherIterator (if available) to cleanup.

                                                                                                                                                                                                                                                                                                                                                                                          In the end, onTaskCompletion nulls out the reference to the ShuffleBlockFetcherIterator (to make it available for garbage collection).

                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/ShuffleMetricsSource/","title":"ShuffleMetricsSource","text":"

                                                                                                                                                                                                                                                                                                                                                                                          = ShuffleMetricsSource

                                                                                                                                                                                                                                                                                                                                                                                          ShuffleMetricsSource is the metrics:spark-metrics-Source.md[metrics source] of a storage:BlockManager.md[] for <>.

                                                                                                                                                                                                                                                                                                                                                                                          ShuffleMetricsSource lives on a Spark executor and is executor:Executor.md#creating-instance-BlockManager-shuffleMetricsSource[registered only when a Spark application runs in a non-local / cluster mode].

                                                                                                                                                                                                                                                                                                                                                                                          .Registering ShuffleMetricsSource with \"executor\" MetricsSystem image::ShuffleMetricsSource.png[align=\"center\"]

                                                                                                                                                                                                                                                                                                                                                                                          == [[creating-instance]] Creating Instance

                                                                                                                                                                                                                                                                                                                                                                                          ShuffleMetricsSource takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                          • <>
                                                                                                                                                                                                                                                                                                                                                                                          • <>

                                                                                                                                                                                                                                                                                                                                                                                            ShuffleMetricsSource is created when BlockManager is requested for the storage:BlockManager.md#shuffleMetricsSource[shuffle metrics source].

                                                                                                                                                                                                                                                                                                                                                                                            == [[sourceName]] Source Name

                                                                                                                                                                                                                                                                                                                                                                                            ShuffleMetricsSource is given a name when <> that is one of the following:

                                                                                                                                                                                                                                                                                                                                                                                            • NettyBlockTransfer when spark.shuffle.service.enabled configuration property is off (false)

                                                                                                                                                                                                                                                                                                                                                                                            • ExternalShuffle when spark.shuffle.service.enabled configuration property is on (true)

                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/ShuffleMigrationRunnable/","title":"ShuffleMigrationRunnable","text":"

                                                                                                                                                                                                                                                                                                                                                                                            ShuffleMigrationRunnable is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/StorageLevel/","title":"StorageLevel","text":"

                                                                                                                                                                                                                                                                                                                                                                                            StorageLevel is the following flags for controlling the storage of an RDD.

                                                                                                                                                                                                                                                                                                                                                                                            Flag Default Value useDisk false useMemory true useOffHeap false deserialized false replication 1","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#restrictions","title":"Restrictions","text":"
                                                                                                                                                                                                                                                                                                                                                                                            1. The replication is restricted to be less than 40 (for calculating the hash code)
                                                                                                                                                                                                                                                                                                                                                                                            2. Off-heap storage level does not support deserialized storage
                                                                                                                                                                                                                                                                                                                                                                                            ","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#validation","title":"Validation
                                                                                                                                                                                                                                                                                                                                                                                            isValid: Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                            StorageLevel is considered valid when the following all hold:

                                                                                                                                                                                                                                                                                                                                                                                            1. Uses memory or disk
                                                                                                                                                                                                                                                                                                                                                                                            2. Replication is non-zero positive number (between the default 1 and 40)
                                                                                                                                                                                                                                                                                                                                                                                            ","text":"","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#externalizable","title":"Externalizable

                                                                                                                                                                                                                                                                                                                                                                                            DirectTaskResult is an Externalizable (Java).

                                                                                                                                                                                                                                                                                                                                                                                            ","text":"","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#writeexternal","title":"writeExternal
                                                                                                                                                                                                                                                                                                                                                                                            writeExternal(\n  out: ObjectOutput): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                            writeExternal\u00a0is part of the Externalizable (Java) abstraction.

                                                                                                                                                                                                                                                                                                                                                                                            writeExternal writes the bitwise representation out followed by the replication of this StorageLevel.

                                                                                                                                                                                                                                                                                                                                                                                            ","text":"","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#bitwise-integer-representation","title":"Bitwise Integer Representation
                                                                                                                                                                                                                                                                                                                                                                                            toInt: Int\n

                                                                                                                                                                                                                                                                                                                                                                                            toInt converts this StorageLevel to numeric (binary) representation by turning the corresponding bits on for the following (if used and in that order):

                                                                                                                                                                                                                                                                                                                                                                                            1. deserialized
                                                                                                                                                                                                                                                                                                                                                                                            2. useOffHeap
                                                                                                                                                                                                                                                                                                                                                                                            3. useMemory
                                                                                                                                                                                                                                                                                                                                                                                            4. useDisk

                                                                                                                                                                                                                                                                                                                                                                                            In other words, the following number in bitwise representation says the StorageLevel is deserialized and useMemory:

                                                                                                                                                                                                                                                                                                                                                                                            import org.apache.spark.storage.StorageLevel.MEMORY_ONLY\nassert(MEMORY_ONLY.toInt == (0 | 1 | 4))\n\nscala> println(MEMORY_ONLY.toInt.toBinaryString)\n101\n

                                                                                                                                                                                                                                                                                                                                                                                            toInt\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                            • StorageLevel is requested to writeExternal and hashCode
                                                                                                                                                                                                                                                                                                                                                                                            ","text":"","tags":["DeveloperApi"]},{"location":"storage/StorageStatus/","title":"StorageStatus","text":"

                                                                                                                                                                                                                                                                                                                                                                                            == [[StorageStatus]] StorageStatus

                                                                                                                                                                                                                                                                                                                                                                                            StorageStatus is a developer API that Spark uses to pass \"just enough\" information about registered storage:BlockManager.md[BlockManagers] in a Spark application between Spark services (mostly for monitoring purposes like spark-webui.md[web UI] or SparkListener.md[]s).

                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/StorageStatus/#note","title":"[NOTE]","text":"

                                                                                                                                                                                                                                                                                                                                                                                            There are two ways to access StorageStatus about all the known BlockManagers in a Spark application:

                                                                                                                                                                                                                                                                                                                                                                                            • SparkContext.md#getExecutorStorageStatus[SparkContext.getExecutorStorageStatus]
                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/StorageStatus/#being-a-sparklistenermd-and-intercepting-sparklistenermdonblockmanageraddedonblockmanageradded-and-sparklistenermdonblockmanagerremovedonblockmanagerremoved-events","title":"* Being a SparkListener.md[] and intercepting SparkListener.md#onBlockManagerAdded[onBlockManagerAdded] and SparkListener.md#onBlockManagerRemoved[onBlockManagerRemoved] events","text":"

                                                                                                                                                                                                                                                                                                                                                                                            StorageStatus is <> when:

                                                                                                                                                                                                                                                                                                                                                                                            • BlockManagerMasterEndpoint storage:BlockManagerMasterEndpoint.md#storageStatus[is requested for storage status] (of every storage:BlockManager.md[BlockManager] in a Spark application)

                                                                                                                                                                                                                                                                                                                                                                                            [[internal-registries]] .StorageStatus's Internal Registries and Counters [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                                                                                                                                                                                                                                                                            | [[_nonRddBlocks]] _nonRddBlocks | Lookup table of BlockIds per BlockId.

                                                                                                                                                                                                                                                                                                                                                                                            Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                            | [[_rddBlocks]] _rddBlocks | Lookup table of BlockIds with BlockStatus per RDD id.

                                                                                                                                                                                                                                                                                                                                                                                            Used when...FIXME |===

                                                                                                                                                                                                                                                                                                                                                                                            === [[updateStorageInfo]] updateStorageInfo Internal Method

                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/StorageStatus/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                            updateStorageInfo( blockId: BlockId, newBlockStatus: BlockStatus): Unit

                                                                                                                                                                                                                                                                                                                                                                                            updateStorageInfo...FIXME

                                                                                                                                                                                                                                                                                                                                                                                            NOTE: updateStorageInfo is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                            === [[creating-instance]] Creating StorageStatus Instance

                                                                                                                                                                                                                                                                                                                                                                                            StorageStatus takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                            • [[blockManagerId]] storage:BlockManagerId.md[]
                                                                                                                                                                                                                                                                                                                                                                                            • [[maxMem]] Maximum memory -- storage:BlockManager.md#maxMemory[total available on-heap and off-heap memory for storage on the BlockManager]

                                                                                                                                                                                                                                                                                                                                                                                            StorageStatus initializes the <>.

                                                                                                                                                                                                                                                                                                                                                                                            === [[rddBlocksById]] Getting RDD Blocks For RDD -- rddBlocksById Method

                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/StorageStatus/#source-scala_1","title":"[source, scala]","text":""},{"location":"storage/StorageStatus/#rddblocksbyidrddid-int-mapblockid-blockstatus","title":"rddBlocksById(rddId: Int): Map[BlockId, BlockStatus]","text":"

                                                                                                                                                                                                                                                                                                                                                                                            rddBlocksById gives the blocks (as BlockId with their status as BlockStatus) that belong to rddId RDD.

                                                                                                                                                                                                                                                                                                                                                                                            === [[removeBlock]] Removing Block (From Internal Registries) -- removeBlock Internal Method

                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/StorageStatus/#source-scala_2","title":"[source, scala]","text":""},{"location":"storage/StorageStatus/#removeblockblockid-blockid-optionblockstatus","title":"removeBlock(blockId: BlockId): Option[BlockStatus]","text":"

                                                                                                                                                                                                                                                                                                                                                                                            removeBlock removes blockId from <<_rddBlocks, _rddBlocks>> registry and returns it.

                                                                                                                                                                                                                                                                                                                                                                                            Internally, removeBlock <> of blockId (to be empty, i.e. removed).

                                                                                                                                                                                                                                                                                                                                                                                            removeBlock branches off per the type of storage:BlockId.md[], i.e. RDDBlockId or not.

                                                                                                                                                                                                                                                                                                                                                                                            For a RDDBlockId, removeBlock finds the RDD in <<_rddBlocks, _rddBlocks>> and removes the blockId. removeBlock removes the RDD (from <<_rddBlocks, _rddBlocks>>) completely, if there are no more blocks registered.

                                                                                                                                                                                                                                                                                                                                                                                            For a non-RDDBlockId, removeBlock removes blockId from <<_nonRddBlocks, _nonRddBlocks>> registry.

                                                                                                                                                                                                                                                                                                                                                                                            === [[addBlock]] Registering Status of Data Block -- addBlock Method

                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/StorageStatus/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                            addBlock( blockId: BlockId, blockStatus: BlockStatus): Unit

                                                                                                                                                                                                                                                                                                                                                                                            addBlock...FIXME

                                                                                                                                                                                                                                                                                                                                                                                            NOTE: addBlock is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                            === [[getBlock]] getBlock Method

                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/StorageStatus/#source-scala_4","title":"[source, scala]","text":""},{"location":"storage/StorageStatus/#getblockblockid-blockid-optionblockstatus","title":"getBlock(blockId: BlockId): Option[BlockStatus]","text":"

                                                                                                                                                                                                                                                                                                                                                                                            getBlock...FIXME

                                                                                                                                                                                                                                                                                                                                                                                            NOTE: getBlock is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/StorageUtils/","title":"StorageUtils","text":""},{"location":"storage/StorageUtils/#port-of-external-shuffle-service","title":"Port of External Shuffle Service
                                                                                                                                                                                                                                                                                                                                                                                            externalShuffleServicePort(\n  conf: SparkConf): Int\n

                                                                                                                                                                                                                                                                                                                                                                                            externalShuffleServicePort...FIXME

                                                                                                                                                                                                                                                                                                                                                                                            externalShuffleServicePort\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is created
                                                                                                                                                                                                                                                                                                                                                                                            • BlockManagerMasterEndpoint is created
                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/TempFileBasedBlockStoreUpdater/","title":"TempFileBasedBlockStoreUpdater","text":"

                                                                                                                                                                                                                                                                                                                                                                                            TempFileBasedBlockStoreUpdater is a BlockStoreUpdater (that BlockManager uses for storing a block from bytes in a local temporary file).

                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/TempFileBasedBlockStoreUpdater/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                            TempFileBasedBlockStoreUpdater takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                            • BlockId
                                                                                                                                                                                                                                                                                                                                                                                            • StorageLevel
                                                                                                                                                                                                                                                                                                                                                                                            • ClassTag (Scala)
                                                                                                                                                                                                                                                                                                                                                                                            • Temporary File
                                                                                                                                                                                                                                                                                                                                                                                            • Block Size
                                                                                                                                                                                                                                                                                                                                                                                            • tellMaster flag (default: true)
                                                                                                                                                                                                                                                                                                                                                                                            • keepReadLock flag (default: false)

                                                                                                                                                                                                                                                                                                                                                                                              TempFileBasedBlockStoreUpdater is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                              • BlockManager is requested to putBlockDataAsStream
                                                                                                                                                                                                                                                                                                                                                                                              • PythonBroadcast is requested to readObject
                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"storage/TempFileBasedBlockStoreUpdater/#block-data","title":"Block Data
                                                                                                                                                                                                                                                                                                                                                                                              blockData(): BlockData\n

                                                                                                                                                                                                                                                                                                                                                                                              blockData requests the DiskStore (of the parent BlockManager) to getBytes (with the temp file and the block size).

                                                                                                                                                                                                                                                                                                                                                                                              blockData\u00a0is part of the BlockStoreUpdater abstraction.

                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"storage/TempFileBasedBlockStoreUpdater/#storing-block-to-disk","title":"Storing Block to Disk
                                                                                                                                                                                                                                                                                                                                                                                              saveToDiskStore(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                              saveToDiskStore requests the DiskStore (of the parent BlockManager) to moveFileToBlock.

                                                                                                                                                                                                                                                                                                                                                                                              saveToDiskStore\u00a0is part of the BlockStoreUpdater abstraction.

                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/","title":"Spark Tools","text":"

                                                                                                                                                                                                                                                                                                                                                                                              Main abstractions:

                                                                                                                                                                                                                                                                                                                                                                                              • AbstractCommandBuilder
                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/AbstractCommandBuilder/","title":"AbstractCommandBuilder","text":"

                                                                                                                                                                                                                                                                                                                                                                                              AbstractCommandBuilder is an abstraction of launch command builders.

                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/AbstractCommandBuilder/#contract","title":"Contract","text":""},{"location":"tools/AbstractCommandBuilder/#buildCommand","title":"Building Command","text":"
                                                                                                                                                                                                                                                                                                                                                                                              List<String> buildCommand(\n  Map<String, String> env)\n

                                                                                                                                                                                                                                                                                                                                                                                              Builds a command to launch a script on command line

                                                                                                                                                                                                                                                                                                                                                                                              See:

                                                                                                                                                                                                                                                                                                                                                                                              • SparkClassCommandBuilder
                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmitCommandBuilder

                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                              • Main is requested to build a command
                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/AbstractCommandBuilder/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                              • SparkClassCommandBuilder
                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmitCommandBuilder
                                                                                                                                                                                                                                                                                                                                                                                              • WorkerCommandBuilder
                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/AbstractCommandBuilder/#buildjavacommand","title":"buildJavaCommand
                                                                                                                                                                                                                                                                                                                                                                                              List<String> buildJavaCommand(\n  String extraClassPath)\n

                                                                                                                                                                                                                                                                                                                                                                                              buildJavaCommand builds the Java command for a Spark application (which is a collection of elements with the path to java executable, JVM options from java-opts file, and a class path).

                                                                                                                                                                                                                                                                                                                                                                                              If javaHome is set, buildJavaCommand adds [javaHome]/bin/java to the result Java command. Otherwise, it uses JAVA_HOME or, when no earlier checks succeeded, falls through to java.home Java's system property.

                                                                                                                                                                                                                                                                                                                                                                                              CAUTION: FIXME Who sets javaHome internal property and when?

                                                                                                                                                                                                                                                                                                                                                                                              buildJavaCommand loads extra Java options from the java-opts file in configuration directory if the file exists and adds them to the result Java command.

                                                                                                                                                                                                                                                                                                                                                                                              Eventually, buildJavaCommand builds the class path (with the extra class path if non-empty) and adds it as -cp to the result Java command.

                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/AbstractCommandBuilder/#buildclasspath","title":"buildClassPath
                                                                                                                                                                                                                                                                                                                                                                                              List<String> buildClassPath(\n  String appClassPath)\n

                                                                                                                                                                                                                                                                                                                                                                                              buildClassPath builds the classpath for a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                              Directories always end up with the OS-specific file separator at the end of their paths.

                                                                                                                                                                                                                                                                                                                                                                                              buildClassPath adds the following in that order:

                                                                                                                                                                                                                                                                                                                                                                                              1. SPARK_CLASSPATH environment variable
                                                                                                                                                                                                                                                                                                                                                                                              2. The input appClassPath
                                                                                                                                                                                                                                                                                                                                                                                              3. The configuration directory
                                                                                                                                                                                                                                                                                                                                                                                              4. (only with SPARK_PREPEND_CLASSES set or SPARK_TESTING being 1) Locally compiled Spark classes in classes, test-classes and Core's jars. + CAUTION: FIXME Elaborate on \"locally compiled Spark classes\".

                                                                                                                                                                                                                                                                                                                                                                                              5. (only with SPARK_SQL_TESTING being 1) ... + CAUTION: FIXME Elaborate on the SQL testing case

                                                                                                                                                                                                                                                                                                                                                                                              6. HADOOP_CONF_DIR environment variable

                                                                                                                                                                                                                                                                                                                                                                                              7. YARN_CONF_DIR environment variable

                                                                                                                                                                                                                                                                                                                                                                                              8. SPARK_DIST_CLASSPATH environment variable

                                                                                                                                                                                                                                                                                                                                                                                              NOTE: childEnv is queried first before System properties. It is always empty for AbstractCommandBuilder (and SparkSubmitCommandBuilder, too).

                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/AbstractCommandBuilder/#loading-properties-file","title":"Loading Properties File
                                                                                                                                                                                                                                                                                                                                                                                              Properties loadPropertiesFile()\n

                                                                                                                                                                                                                                                                                                                                                                                              loadPropertiesFile loads Spark settings from a properties file (when specified on the command line) or spark-defaults.conf in the configuration directory.

                                                                                                                                                                                                                                                                                                                                                                                              loadPropertiesFile loads the settings from the following files starting from the first and checking every location until the first properties file is found:

                                                                                                                                                                                                                                                                                                                                                                                              1. propertiesFile (if specified using --properties-file command-line option or set by AbstractCommandBuilder.setPropertiesFile).
                                                                                                                                                                                                                                                                                                                                                                                              2. [SPARK_CONF_DIR]/spark-defaults.conf
                                                                                                                                                                                                                                                                                                                                                                                              3. [SPARK_HOME]/conf/spark-defaults.conf
                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/AbstractCommandBuilder/#sparks-configuration-directory","title":"Spark's Configuration Directory

                                                                                                                                                                                                                                                                                                                                                                                              AbstractCommandBuilder uses getConfDir to compute the current configuration directory of a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                              It uses SPARK_CONF_DIR (from childEnv which is always empty anyway or as a environment variable) and falls through to [SPARK_HOME]/conf (with SPARK_HOME from getSparkHome).

                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/AbstractCommandBuilder/#sparks-home-directory","title":"Spark's Home Directory

                                                                                                                                                                                                                                                                                                                                                                                              AbstractCommandBuilder uses getSparkHome to compute Spark's home directory for a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                              It uses SPARK_HOME (from childEnv which is always empty anyway or as a environment variable).

                                                                                                                                                                                                                                                                                                                                                                                              If SPARK_HOME is not set, Spark throws a IllegalStateException:

                                                                                                                                                                                                                                                                                                                                                                                              Spark home not found; set it explicitly or use the SPARK_HOME environment variable.\n
                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/AbstractCommandBuilder/#appResource","title":"Application Resource
                                                                                                                                                                                                                                                                                                                                                                                              String appResource\n

                                                                                                                                                                                                                                                                                                                                                                                              AbstractCommandBuilder uses appResource variable for the name of an application resource.

                                                                                                                                                                                                                                                                                                                                                                                              appResource can be one of the following application resource names:

                                                                                                                                                                                                                                                                                                                                                                                              Identifier appResource pyspark-shell-main pyspark-shell-main sparkr-shell-main sparkr-shell-main run-example findExamplesAppJar pyspark-shell buildPySparkShellCommand sparkr-shell buildSparkRCommand

                                                                                                                                                                                                                                                                                                                                                                                              appResource can be specified when:

                                                                                                                                                                                                                                                                                                                                                                                              • AbstractLauncher is requested to setAppResource
                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmitCommandBuilder is created
                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmitCommandBuilder.OptionParser is requested to handle known or unknown options

                                                                                                                                                                                                                                                                                                                                                                                              appResource is used when:

                                                                                                                                                                                                                                                                                                                                                                                              • SparkLauncher is requested to startApplication
                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmitCommandBuilder is requested to build a command, buildSparkSubmitArgs
                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/AbstractLauncher/","title":"AbstractLauncher","text":"

                                                                                                                                                                                                                                                                                                                                                                                              AbstractLauncher is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/DependencyUtils/","title":"DependencyUtils Utilities","text":""},{"location":"tools/DependencyUtils/#resolveglobpaths","title":"resolveGlobPaths
                                                                                                                                                                                                                                                                                                                                                                                              resolveGlobPaths(\n  paths: String,\n  hadoopConf: Configuration): String\n

                                                                                                                                                                                                                                                                                                                                                                                              resolveGlobPaths...FIXME

                                                                                                                                                                                                                                                                                                                                                                                              resolveGlobPaths is used when:

                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmit is requested to prepareSubmitEnvironment
                                                                                                                                                                                                                                                                                                                                                                                              • DependencyUtils is used to resolveAndDownloadJars
                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/DependencyUtils/#downloadfile","title":"downloadFile
                                                                                                                                                                                                                                                                                                                                                                                              downloadFile(\n  path: String,\n  targetDir: File,\n  sparkConf: SparkConf,\n  hadoopConf: Configuration,\n  secMgr: SecurityManager): String\n

                                                                                                                                                                                                                                                                                                                                                                                              downloadFile resolves the path to a well-formed URI and branches off based on the scheme:

                                                                                                                                                                                                                                                                                                                                                                                              • For file and local schemes, downloadFile returns the input path
                                                                                                                                                                                                                                                                                                                                                                                              • For other schemes, downloadFile...FIXME

                                                                                                                                                                                                                                                                                                                                                                                              downloadFile is used when:

                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmit is requested to prepareSubmitEnvironment
                                                                                                                                                                                                                                                                                                                                                                                              • DependencyUtils is used to downloadFileList
                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/DependencyUtils/#downloadfilelist","title":"downloadFileList
                                                                                                                                                                                                                                                                                                                                                                                              downloadFileList(\n  fileList: String,\n  targetDir: File,\n  sparkConf: SparkConf,\n  hadoopConf: Configuration,\n  secMgr: SecurityManager): String\n

                                                                                                                                                                                                                                                                                                                                                                                              downloadFileList...FIXME

                                                                                                                                                                                                                                                                                                                                                                                              downloadFileList is used when:

                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmit is requested to prepareSubmitEnvironment
                                                                                                                                                                                                                                                                                                                                                                                              • DependencyUtils is used to resolveAndDownloadJars
                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/DependencyUtils/#resolvemavendependencies","title":"resolveMavenDependencies
                                                                                                                                                                                                                                                                                                                                                                                              resolveMavenDependencies(\n  packagesExclusions: String,\n  packages: String,\n  repositories: String,\n  ivyRepoPath: String,\n  ivySettingsPath: Option[String]): String\n

                                                                                                                                                                                                                                                                                                                                                                                              resolveMavenDependencies...FIXME

                                                                                                                                                                                                                                                                                                                                                                                              resolveMavenDependencies is used when:

                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmit is requested to prepareSubmitEnvironment (for all resource managers but Spark Standalone and Apache Mesos)
                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/DependencyUtils/#adding-local-jars-to-classloader","title":"Adding Local Jars to ClassLoader
                                                                                                                                                                                                                                                                                                                                                                                              addJarToClasspath(\n  localJar: String,\n  loader: MutableURLClassLoader): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                              addJarToClasspath adds file and local jars (as localJar) to the loader classloader.

                                                                                                                                                                                                                                                                                                                                                                                              addJarToClasspath resolves the URI of localJar. If the URI is file or local and the file denoted by localJar exists, localJar is added to loader. Otherwise, the following warning is printed out to the logs:

                                                                                                                                                                                                                                                                                                                                                                                              Warning: Local jar /path/to/fake.jar does not exist, skipping.\n

                                                                                                                                                                                                                                                                                                                                                                                              For all other URIs, the following warning is printed out to the logs:

                                                                                                                                                                                                                                                                                                                                                                                              Warning: Skip remote jar hdfs://fake.jar.\n

                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                              addJarToClasspath assumes file URI when localJar has no URI specified, e.g. /path/to/local.jar.

                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/DependencyUtils/#resolveanddownloadjars","title":"resolveAndDownloadJars
                                                                                                                                                                                                                                                                                                                                                                                              resolveAndDownloadJars(\n  jars: String,\n  userJar: String,\n  sparkConf: SparkConf,\n  hadoopConf: Configuration,\n  secMgr: SecurityManager): String\n

                                                                                                                                                                                                                                                                                                                                                                                              resolveAndDownloadJars...FIXME

                                                                                                                                                                                                                                                                                                                                                                                              resolveAndDownloadJars is used when:

                                                                                                                                                                                                                                                                                                                                                                                              • DriverWrapper is requested to setupDependencies (Spark Standalone cluster mode)
                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/JavaMainApplication/","title":"JavaMainApplication","text":"

                                                                                                                                                                                                                                                                                                                                                                                              JavaMainApplication is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/Main/","title":"Main","text":"

                                                                                                                                                                                                                                                                                                                                                                                              Main\u00a0is the standalone application that is launched from spark-class shell script.

                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/Main/#main","title":"Launching Application","text":"
                                                                                                                                                                                                                                                                                                                                                                                              void main(\n  String[] argsArray)\n

                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                              main requires that at least the class name (className) is given as the first argument in the given argsArray.

                                                                                                                                                                                                                                                                                                                                                                                              For org.apache.spark.deploy.SparkSubmit class name, main creates a SparkSubmitCommandBuilder and builds a command (with the SparkSubmitCommandBuilder).

                                                                                                                                                                                                                                                                                                                                                                                              Otherwise, main creates a SparkClassCommandBuilder and builds a command (with the SparkClassCommandBuilder).

                                                                                                                                                                                                                                                                                                                                                                                              Class Name AbstractCommandBuilder org.apache.spark.deploy.SparkSubmit SparkSubmitCommandBuilder anything else SparkClassCommandBuilder

                                                                                                                                                                                                                                                                                                                                                                                              In the end, main prepareWindowsCommand or prepareBashCommand based on the operating system it runs on, MS Windows or non-Windows, respectively.

                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/Main/#buildCommand","title":"Building Command","text":"
                                                                                                                                                                                                                                                                                                                                                                                              List<String> buildCommand(\n  AbstractCommandBuilder builder,\n  Map<String, String> env,\n  boolean printLaunchCommand)\n

                                                                                                                                                                                                                                                                                                                                                                                              buildCommand requests the given AbstractCommandBuilder to build a command.

                                                                                                                                                                                                                                                                                                                                                                                              With printLaunchCommand enabled, buildCommand prints out the command to standard error:

                                                                                                                                                                                                                                                                                                                                                                                              Spark Command: [cmd]\n========================================\n

                                                                                                                                                                                                                                                                                                                                                                                              SPARK_PRINT_LAUNCH_COMMAND

                                                                                                                                                                                                                                                                                                                                                                                              printLaunchCommand is controlled by SPARK_PRINT_LAUNCH_COMMAND environment variable.

                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/SparkApplication/","title":"SparkApplication","text":"

                                                                                                                                                                                                                                                                                                                                                                                              SparkApplication is an abstraction of entry points to Spark applications that can be started (submitted for execution using spark-submit).

                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/SparkApplication/#contract","title":"Contract","text":""},{"location":"tools/SparkApplication/#starting-spark-application","title":"Starting Spark Application
                                                                                                                                                                                                                                                                                                                                                                                              start(\n  args: Array[String], conf: SparkConf): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmit is requested to submit an application for execution
                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/SparkApplication/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                              • ClientApp
                                                                                                                                                                                                                                                                                                                                                                                              • JavaMainApplication
                                                                                                                                                                                                                                                                                                                                                                                              • KubernetesClientApplication (Spark on Kubernetes)
                                                                                                                                                                                                                                                                                                                                                                                              • RestSubmissionClientApp
                                                                                                                                                                                                                                                                                                                                                                                              • YarnClusterApplication
                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/SparkClassCommandBuilder/","title":"SparkClassCommandBuilder","text":"

                                                                                                                                                                                                                                                                                                                                                                                              SparkClassCommandBuilder is an AbstractCommandBuilder.

                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/SparkClassCommandBuilder/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                              SparkClassCommandBuilder takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                              • Class Name
                                                                                                                                                                                                                                                                                                                                                                                              • Class Arguments (List<String>)

                                                                                                                                                                                                                                                                                                                                                                                                SparkClassCommandBuilder is created when:

                                                                                                                                                                                                                                                                                                                                                                                                • Main standalone application is launched
                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/SparkLauncher/","title":"SparkLauncher","text":"

                                                                                                                                                                                                                                                                                                                                                                                                SparkLauncher is an interface to launch Spark applications programmatically, i.e. from a code (not spark-submit/index.md[spark-submit] directly). It uses a builder pattern to configure a Spark application and launch it as a child process using spark-submit/index.md[spark-submit].

                                                                                                                                                                                                                                                                                                                                                                                                SparkLauncher uses SparkSubmitCommandBuilder to build the Spark command of a Spark application to launch.

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/SparkLauncher/#spark-internal","title":"spark-internal

                                                                                                                                                                                                                                                                                                                                                                                                SparkLauncher defines spark-internal (NO_RESOURCE) as a special value to inform Spark not to try to process the application resource (primary resource) as a regular file (but as an imaginary resource that cluster managers would know how to look up and submit for execution, e.g. Spark on YARN or Spark on Kubernetes).

                                                                                                                                                                                                                                                                                                                                                                                                spark-internal special value is used when:

                                                                                                                                                                                                                                                                                                                                                                                                • SparkSubmit is requested to prepareSubmitEnvironment and checks whether to add the primaryResource as part of the following:
                                                                                                                                                                                                                                                                                                                                                                                                • --jar (for Spark on YARN in cluster deploy mode)
                                                                                                                                                                                                                                                                                                                                                                                                • --primary-* arguments and define the --main-class argument (for Spark on Kubernetes in cluster deploy mode with KubernetesClientApplication main class)
                                                                                                                                                                                                                                                                                                                                                                                                • SparkSubmit is requested to check whether a resource is internal or not
                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/SparkLauncher/#other","title":"Other

                                                                                                                                                                                                                                                                                                                                                                                                .SparkLauncher's Builder Methods to Set Up Invocation of Spark Application [options=\"header\",width=\"100%\"] |=== | Setter | Description | addAppArgs(String... args) | Adds command line arguments for a Spark application. | addFile(String file) | Adds a file to be submitted with a Spark application. | addJar(String jar) | Adds a jar file to be submitted with the application. | addPyFile(String file) | Adds a python file / zip / egg to be submitted with a Spark application. | addSparkArg(String arg) | Adds a no-value argument to the Spark invocation. | addSparkArg(String name, String value) | Adds an argument with a value to the Spark invocation. It recognizes known command-line arguments, i.e. --master, --properties-file, --conf, --class, --jars, --files, and --py-files. | directory(File dir) | Sets the working directory of spark-submit. | redirectError() | Redirects stderr to stdout. | redirectError(File errFile) | Redirects error output to the specified errFile file. | redirectError(ProcessBuilder.Redirect to) | Redirects error output to the specified to Redirect. | redirectOutput(File outFile) | Redirects output to the specified outFile file. | redirectOutput(ProcessBuilder.Redirect to) | Redirects standard output to the specified to Redirect. | redirectToLog(String loggerName) | Sets all output to be logged and redirected to a logger with the specified name. | setAppName(String appName) | Sets the name of an Spark application | setAppResource(String resource) | Sets the main application resource, i.e. the location of a jar file for Scala/Java applications. | setConf(String key, String value) | Sets a Spark property. Expects key starting with spark. prefix. | setDeployMode(String mode) | Sets the deploy mode. | setJavaHome(String javaHome) | Sets a custom JAVA_HOME. | setMainClass(String mainClass) | Sets the main class. | setMaster(String master) | Sets the master URL. | setPropertiesFile(String path) | Sets the internal propertiesFile.

                                                                                                                                                                                                                                                                                                                                                                                                See spark-AbstractCommandBuilder.md#loadPropertiesFile[loadPropertiesFile Internal Method]. | setSparkHome(String sparkHome) | Sets a custom SPARK_HOME. | setVerbose(boolean verbose) | Enables verbose reporting for SparkSubmit. |===

                                                                                                                                                                                                                                                                                                                                                                                                After the invocation of a Spark application is set up, use launch() method to launch a sub-process that will start the configured Spark application. It is however recommended to use startApplication method instead.

                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/SparkLauncher/#source-scala","title":"[source, scala]

                                                                                                                                                                                                                                                                                                                                                                                                import org.apache.spark.launcher.SparkLauncher

                                                                                                                                                                                                                                                                                                                                                                                                val command = new SparkLauncher() .setAppResource(\"SparkPi\") .setVerbose(true)

                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/SparkLauncher/#val-apphandle-commandstartapplication","title":"val appHandle = command.startApplication()","text":""},{"location":"tools/pyspark/","title":"pyspark Shell Script","text":"

                                                                                                                                                                                                                                                                                                                                                                                                pyspark shell script runs spark-submit with pyspark-shell-main application resource as the first argument followed by --name \"PySparkShell\" option (with other command-line arguments, if specified).

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/pyspark/#pyspark-shell","title":"pyspark/shell.py","text":"

                                                                                                                                                                                                                                                                                                                                                                                                pyspark/shell.py

                                                                                                                                                                                                                                                                                                                                                                                                Learn more about pyspark/shell.py in The Internals of PySpark.

                                                                                                                                                                                                                                                                                                                                                                                                pyspark/shell.py module is launched as a PYTHONSTARTUP script.

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/pyspark/#environment-variables","title":"Environment Variables","text":"

                                                                                                                                                                                                                                                                                                                                                                                                pyspark script exports the following environment variables:

                                                                                                                                                                                                                                                                                                                                                                                                • OLD_PYTHONSTARTUP
                                                                                                                                                                                                                                                                                                                                                                                                • PYSPARK_DRIVER_PYTHON
                                                                                                                                                                                                                                                                                                                                                                                                • PYSPARK_DRIVER_PYTHON_OPTS
                                                                                                                                                                                                                                                                                                                                                                                                • PYSPARK_PYTHON
                                                                                                                                                                                                                                                                                                                                                                                                • PYTHONPATH
                                                                                                                                                                                                                                                                                                                                                                                                • PYTHONSTARTUP
                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/pyspark/#OLD_PYTHONSTARTUP","title":"OLD_PYTHONSTARTUP","text":"

                                                                                                                                                                                                                                                                                                                                                                                                pyspark defines OLD_PYTHONSTARTUP environment variable to be the initial value of PYTHONSTARTUP (before it gets redefined).

                                                                                                                                                                                                                                                                                                                                                                                                The idea of OLD_PYTHONSTARTUP is to delay execution of the Python startup script until pyspark/shell.py finishes.

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/pyspark/#PYSPARK_PYTHON","title":"PYSPARK_PYTHON","text":"

                                                                                                                                                                                                                                                                                                                                                                                                PYSPARK_PYTHON environment variable can be used to specify a Python executable to run PySpark scripts.

                                                                                                                                                                                                                                                                                                                                                                                                The Internals of PySpark

                                                                                                                                                                                                                                                                                                                                                                                                Learn more about PySpark in The Internals of PySpark.

                                                                                                                                                                                                                                                                                                                                                                                                PYSPARK_PYTHON can be overriden by PYSPARK_DRIVER_PYTHON and configuration properties when SparkSubmitCommandBuilder is requested to buildPySparkShellCommand.

                                                                                                                                                                                                                                                                                                                                                                                                PYSPARK_PYTHON is overriden by spark.pyspark.python configuration property, if defined, when SparkSubmitCommandBuilder is requested to buildPySparkShellCommand.

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/pyspark/#PYTHONSTARTUP","title":"PYTHONSTARTUP","text":"

                                                                                                                                                                                                                                                                                                                                                                                                From Python Documentation:

                                                                                                                                                                                                                                                                                                                                                                                                PYTHONSTARTUP

                                                                                                                                                                                                                                                                                                                                                                                                If this is the name of a readable file, the Python commands in that file are executed before the first prompt is displayed in interactive mode. The file is executed in the same namespace where interactive commands are executed so that objects defined or imported in it can be used without qualification in the interactive session. You can also change the prompts sys.ps1 and sys.ps2 and the hook sys.__interactivehook__ in this file.

                                                                                                                                                                                                                                                                                                                                                                                                pyspark (re)defines PYTHONSTARTUP environment variable to be pyspark/shell.py module:

                                                                                                                                                                                                                                                                                                                                                                                                ${SPARK_HOME}/python/pyspark/shell.py\n

                                                                                                                                                                                                                                                                                                                                                                                                OLD_PYTHONSTARTUP

                                                                                                                                                                                                                                                                                                                                                                                                The initial value of PYTHONSTARTUP environment variable is available as OLD_PYTHONSTARTUP.

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-class/","title":"spark-class shell script","text":"

                                                                                                                                                                                                                                                                                                                                                                                                spark-class shell script is the Spark application command-line launcher that is responsible for setting up JVM environment and executing a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                NOTE: Ultimately, any shell script in Spark, e.g. link:spark-submit.adoc[spark-submit], calls spark-class script.

                                                                                                                                                                                                                                                                                                                                                                                                You can find spark-class script in bin directory of the Spark distribution.

                                                                                                                                                                                                                                                                                                                                                                                                When started, spark-class first loads $SPARK_HOME/bin/load-spark-env.sh, collects the Spark assembly jars, and executes <>.

                                                                                                                                                                                                                                                                                                                                                                                                Depending on the Spark distribution (or rather lack thereof), i.e. whether RELEASE file exists or not, it sets SPARK_JARS_DIR environment variable to [SPARK_HOME]/jars or [SPARK_HOME]/assembly/target/scala-[SPARK_SCALA_VERSION]/jars, respectively (with the latter being a local build).

                                                                                                                                                                                                                                                                                                                                                                                                If SPARK_JARS_DIR does not exist, spark-class prints the following error message and exits with the code 1.

                                                                                                                                                                                                                                                                                                                                                                                                Failed to find Spark jars directory ([SPARK_JARS_DIR]).\nYou need to build Spark with the target \"package\" before running this program.\n

                                                                                                                                                                                                                                                                                                                                                                                                spark-class sets LAUNCH_CLASSPATH environment variable to include all the jars under SPARK_JARS_DIR.

                                                                                                                                                                                                                                                                                                                                                                                                If SPARK_PREPEND_CLASSES is enabled, [SPARK_HOME]/launcher/target/scala-[SPARK_SCALA_VERSION]/classes directory is added to LAUNCH_CLASSPATH as the first entry.

                                                                                                                                                                                                                                                                                                                                                                                                NOTE: Use SPARK_PREPEND_CLASSES to have the Spark launcher classes (from [SPARK_HOME]/launcher/target/scala-[SPARK_SCALA_VERSION]/classes) to appear before the other Spark assembly jars. It is useful for development so your changes don't require rebuilding Spark again.

                                                                                                                                                                                                                                                                                                                                                                                                SPARK_TESTING and SPARK_SQL_TESTING environment variables enable test special mode.

                                                                                                                                                                                                                                                                                                                                                                                                CAUTION: FIXME What's so special about the env vars?

                                                                                                                                                                                                                                                                                                                                                                                                spark-class uses <> command-line application to compute the Spark command to launch. The Main class programmatically computes the command that spark-class executes afterwards.

                                                                                                                                                                                                                                                                                                                                                                                                TIP: Use JAVA_HOME to point at the JVM to use.

                                                                                                                                                                                                                                                                                                                                                                                                === [[main]] Launching org.apache.spark.launcher.Main Standalone Application

                                                                                                                                                                                                                                                                                                                                                                                                org.apache.spark.launcher.Main is a Scala standalone application used in spark-class to prepare the Spark command to execute.

                                                                                                                                                                                                                                                                                                                                                                                                Main expects that the first parameter is the class name that is the \"operation mode\":

                                                                                                                                                                                                                                                                                                                                                                                                1. org.apache.spark.deploy.SparkSubmit -- Main uses link:spark-submit-SparkSubmitCommandBuilder.adoc[SparkSubmitCommandBuilder] to parse command-line arguments. This is the mode link:spark-submit.adoc[spark-submit] uses.
                                                                                                                                                                                                                                                                                                                                                                                                2. anything -- Main uses SparkClassCommandBuilder to parse command-line arguments.
                                                                                                                                                                                                                                                                                                                                                                                                $ ./bin/spark-class org.apache.spark.launcher.Main\nException in thread \"main\" java.lang.IllegalArgumentException: Not enough arguments: missing class name.\n    at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)\n    at org.apache.spark.launcher.Main.main(Main.java:51)\n

                                                                                                                                                                                                                                                                                                                                                                                                Main uses buildCommand method on the builder to build a Spark command.

                                                                                                                                                                                                                                                                                                                                                                                                If SPARK_PRINT_LAUNCH_COMMAND environment variable is enabled, Main prints the final Spark command to standard error.

                                                                                                                                                                                                                                                                                                                                                                                                Spark Command: [cmd]\n========================================\n

                                                                                                                                                                                                                                                                                                                                                                                                If on Windows it calls prepareWindowsCommand while on non-Windows OSes prepareBashCommand with tokens separated by \u0000\u0000\\0.

                                                                                                                                                                                                                                                                                                                                                                                                CAUTION: FIXME What's prepareWindowsCommand? prepareBashCommand?

                                                                                                                                                                                                                                                                                                                                                                                                Main uses the following environment variables:

                                                                                                                                                                                                                                                                                                                                                                                                • SPARK_DAEMON_JAVA_OPTS and SPARK_MASTER_OPTS to be added to the command line of the command.
                                                                                                                                                                                                                                                                                                                                                                                                • SPARK_DAEMON_MEMORY (default: 1g) for -Xms and -Xmx.
                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-shell/","title":"spark-shell shell script","text":"

                                                                                                                                                                                                                                                                                                                                                                                                Spark shell is an interactive environment where you can learn how to make the most out of Apache Spark quickly and conveniently.

                                                                                                                                                                                                                                                                                                                                                                                                TIP: Spark shell is particularly helpful for fast interactive prototyping.

                                                                                                                                                                                                                                                                                                                                                                                                Under the covers, Spark shell is a standalone Spark application written in Scala that offers environment with auto-completion (using TAB key) where you can run ad-hoc queries and get familiar with the features of Spark (that help you in developing your own standalone Spark applications). It is a very convenient tool to explore the many things available in Spark with immediate feedback. It is one of the many reasons why spark-overview.md#why-spark[Spark is so helpful for tasks to process datasets of any size].

                                                                                                                                                                                                                                                                                                                                                                                                There are variants of Spark shell for different languages: spark-shell for Scala, pyspark for Python and sparkR for R.

                                                                                                                                                                                                                                                                                                                                                                                                NOTE: This document (and the book in general) uses spark-shell for Scala only.

                                                                                                                                                                                                                                                                                                                                                                                                You can start Spark shell using <spark-shell script>>.

                                                                                                                                                                                                                                                                                                                                                                                                $ ./bin/spark-shell\nscala>\n

                                                                                                                                                                                                                                                                                                                                                                                                spark-shell is an extension of Scala REPL with automatic instantiation of spark-sql-SparkSession.md[SparkSession] as spark (and SparkContext.md[] as sc).

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-shell/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                scala> :type spark org.apache.spark.sql.SparkSession

                                                                                                                                                                                                                                                                                                                                                                                                // Learn the current version of Spark in use scala> spark.version res0: String = 2.1.0-SNAPSHOT

                                                                                                                                                                                                                                                                                                                                                                                                spark-shell also imports spark-sql-SparkSession.md#implicits[Scala SQL's implicits] and spark-sql-SparkSession.md#sql[sql method].

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-shell/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                scala> :imports 1) import spark.implicits._ (59 terms, 38 are implicit) 2) import spark.sql (1 terms)

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-shell/#note","title":"[NOTE]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                When you execute spark-shell you actually execute spark-submit/index.md[Spark submit] as follows:

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-shell/#optionswrap","title":"[options=\"wrap\"]","text":""},{"location":"tools/spark-shell/#orgapachesparkdeploysparksubmit-class-orgapachesparkreplmain-name-spark-shell-spark-shell","title":"org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main --name Spark shell spark-shell","text":""},{"location":"tools/spark-shell/#set-spark_print_launch_command-to-see-the-entire-command-to-be-executed-refer-to-spark-tips-and-tricksmdspark_print_launch_commandprint-launch-command-of-spark-scripts","title":"Set SPARK_PRINT_LAUNCH_COMMAND to see the entire command to be executed. Refer to spark-tips-and-tricks.md#SPARK_PRINT_LAUNCH_COMMAND[Print Launch Command of Spark Scripts].","text":"

                                                                                                                                                                                                                                                                                                                                                                                                === [[using-spark-shell]] Using Spark shell

                                                                                                                                                                                                                                                                                                                                                                                                You start Spark shell using spark-shell script (available in bin directory).

                                                                                                                                                                                                                                                                                                                                                                                                $ ./bin/spark-shell\nSetting default log level to \"WARN\".\nTo adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\nWARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\nWARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException\nSpark context Web UI available at http://10.47.71.138:4040\nSpark context available as 'sc' (master = local[*], app id = local-1477858597347).\nSpark session available as 'spark'.\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 2.1.0-SNAPSHOT\n      /_/\n\nUsing Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)\nType in expressions to have them evaluated.\nType :help for more information.\n\nscala>\n

                                                                                                                                                                                                                                                                                                                                                                                                Spark shell creates an instance of spark-sql-SparkSession.md[SparkSession] under the name spark for you (so you don't have to know the details how to do it yourself on day 1).

                                                                                                                                                                                                                                                                                                                                                                                                scala> :type spark\norg.apache.spark.sql.SparkSession\n

                                                                                                                                                                                                                                                                                                                                                                                                Besides, there is also sc value created which is an instance of SparkContext.md[].

                                                                                                                                                                                                                                                                                                                                                                                                scala> :type sc\norg.apache.spark.SparkContext\n

                                                                                                                                                                                                                                                                                                                                                                                                To close Spark shell, you press Ctrl+D or type in :q (or any subset of :quit).

                                                                                                                                                                                                                                                                                                                                                                                                scala> :q\n
                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-submit/","title":"spark-submit Shell Script","text":"

                                                                                                                                                                                                                                                                                                                                                                                                spark-submit shell script allows managing Spark applications.

                                                                                                                                                                                                                                                                                                                                                                                                spark-submit is a command-line frontend to SparkSubmit.

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-submit/#options","title":"Command-Line Options","text":""},{"location":"tools/spark-submit/#archives","title":"archives","text":"
                                                                                                                                                                                                                                                                                                                                                                                                • Command-Line Option: --archives
                                                                                                                                                                                                                                                                                                                                                                                                • Internal Property: archives
                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-submit/#deploy-mode","title":"deploy-mode","text":"

                                                                                                                                                                                                                                                                                                                                                                                                Deploy mode

                                                                                                                                                                                                                                                                                                                                                                                                • Command-Line Option: --deploy-mode
                                                                                                                                                                                                                                                                                                                                                                                                • Spark Property: spark.submit.deployMode
                                                                                                                                                                                                                                                                                                                                                                                                • Environment Variable: DEPLOY_MODE
                                                                                                                                                                                                                                                                                                                                                                                                • Internal Property: deployMode
                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-submit/#driver-class-path","title":"driver-class-path","text":"
                                                                                                                                                                                                                                                                                                                                                                                                --driver-class-path\n

                                                                                                                                                                                                                                                                                                                                                                                                Extra class path entries (e.g. jars and directories) to pass to a driver's JVM.

                                                                                                                                                                                                                                                                                                                                                                                                --driver-class-path command-line option sets the extra class path entries (e.g. jars and directories) that should be added to a driver's JVM.

                                                                                                                                                                                                                                                                                                                                                                                                Tip

                                                                                                                                                                                                                                                                                                                                                                                                Use --driver-class-path in client deploy mode (not SparkConf) to ensure that the CLASSPATH is set up with the entries.

                                                                                                                                                                                                                                                                                                                                                                                                client deploy mode uses the same JVM for the driver as spark-submit's.

                                                                                                                                                                                                                                                                                                                                                                                                Internal Property: driverExtraClassPath

                                                                                                                                                                                                                                                                                                                                                                                                Spark Property: spark.driver.extraClassPath

                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                Command-line options (e.g. --driver-class-path) have higher precedence than their corresponding Spark settings in a Spark properties file (e.g. spark.driver.extraClassPath). You can therefore control the final settings by overriding Spark settings on command line using the command-line options.

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-submit/#driver-cores","title":"driver-cores","text":"
                                                                                                                                                                                                                                                                                                                                                                                                --driver-cores NUM\n

                                                                                                                                                                                                                                                                                                                                                                                                --driver-cores command-line option sets the number of cores to NUM for the driver in the cluster deploy mode.

                                                                                                                                                                                                                                                                                                                                                                                                Spark Property: spark.driver.cores

                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                Only available for cluster deploy mode (when the driver is executed outside spark-submit).

                                                                                                                                                                                                                                                                                                                                                                                                Internal Property: driverCores

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-submit/#properties-file","title":"properties-file","text":"
                                                                                                                                                                                                                                                                                                                                                                                                --properties-file [FILE]\n

                                                                                                                                                                                                                                                                                                                                                                                                --properties-file command-line option sets the path to a file FILE from which Spark loads extra Spark properties.

                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                Spark uses conf/spark-defaults.conf by default.

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-submit/#queue","title":"queue","text":"
                                                                                                                                                                                                                                                                                                                                                                                                --queue QUEUE_NAME\n

                                                                                                                                                                                                                                                                                                                                                                                                YARN resource queue

                                                                                                                                                                                                                                                                                                                                                                                                • Spark Property: spark.yarn.queue
                                                                                                                                                                                                                                                                                                                                                                                                • Internal Property: queue
                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-submit/#version","title":"version","text":"

                                                                                                                                                                                                                                                                                                                                                                                                Command-Line Option: --version

                                                                                                                                                                                                                                                                                                                                                                                                $ ./bin/spark-submit --version\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 2.1.0-SNAPSHOT\n      /_/\n\nBranch master\nCompiled by user jacek on 2016-09-30T07:08:39Z\nRevision 1fad5596885aab8b32d2307c0edecbae50d5bd7a\nUrl https://github.com/apache/spark.git\nType --help for more information.\n
                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-submit/#SPARK_PRINT_LAUNCH_COMMAND","title":"SPARK_PRINT_LAUNCH_COMMAND","text":"

                                                                                                                                                                                                                                                                                                                                                                                                SPARK_PRINT_LAUNCH_COMMAND environment variable allows to have the complete Spark command printed out to the standard output.

                                                                                                                                                                                                                                                                                                                                                                                                $ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell\nSpark Command: /Library/Ja...\n
                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-submit/SparkSubmit/","title":"SparkSubmit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                SparkSubmit is the entry point to spark-submit shell script.

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-submit/SparkSubmit/#special-primary-resource-names","title":"Special Primary Resource Names

                                                                                                                                                                                                                                                                                                                                                                                                SparkSubmit uses the following special primary resource names to represent Spark shells rather than application jars:

                                                                                                                                                                                                                                                                                                                                                                                                • spark-shell
                                                                                                                                                                                                                                                                                                                                                                                                • pyspark-shell
                                                                                                                                                                                                                                                                                                                                                                                                • sparkr-shell
                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#pyspark-shell","title":"pyspark-shell

                                                                                                                                                                                                                                                                                                                                                                                                SparkSubmit uses pyspark-shell when:

                                                                                                                                                                                                                                                                                                                                                                                                • SparkSubmit is requested to prepareSubmitEnvironment for .py scripts or pyspark, isShell and isPython
                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#isshell","title":"isShell
                                                                                                                                                                                                                                                                                                                                                                                                isShell(\n  res: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                isShell is true when the given res primary resource represents a Spark shell.

                                                                                                                                                                                                                                                                                                                                                                                                isShell\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                • SparkSubmit is requested to prepareSubmitEnvironment and isUserJar
                                                                                                                                                                                                                                                                                                                                                                                                • SparkSubmitArguments is requested to handleUnknown (and determine a primary application resource)
                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#actions","title":"Actions

                                                                                                                                                                                                                                                                                                                                                                                                SparkSubmit executes actions (based on the action argument).

                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#killing-submission","title":"Killing Submission
                                                                                                                                                                                                                                                                                                                                                                                                kill(\n  args: SparkSubmitArguments): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                kill...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#displaying-version","title":"Displaying Version
                                                                                                                                                                                                                                                                                                                                                                                                printVersion(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                printVersion...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#submission-status","title":"Submission Status
                                                                                                                                                                                                                                                                                                                                                                                                requestStatus(\n  args: SparkSubmitArguments): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                requestStatus...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#submit","title":"Application Submission
                                                                                                                                                                                                                                                                                                                                                                                                submit(\n  args: SparkSubmitArguments,\n  uninitLog: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                submit doRunMain unless isStandaloneCluster and useRest.

                                                                                                                                                                                                                                                                                                                                                                                                For isStandaloneCluster with useRest requested, submit...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#doRunMain","title":"doRunMain","text":"
                                                                                                                                                                                                                                                                                                                                                                                                doRunMain(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                doRunMain runMain unless proxyUser is specified.

                                                                                                                                                                                                                                                                                                                                                                                                With proxyUser specified, doRunMain...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-submit/SparkSubmit/#runMain","title":"Running Main Class","text":"
                                                                                                                                                                                                                                                                                                                                                                                                runMain(\n  args: SparkSubmitArguments,\n  uninitLog: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                runMain prepares submit environment for the given SparkSubmitArguments (that gives childArgs, childClasspath, sparkConf and childMainClass).

                                                                                                                                                                                                                                                                                                                                                                                                With verbose enabled, runMain prints out the following INFO messages to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                Main class:\n[childMainClass]\nArguments:\n[childArgs]\nSpark config:\n[sparkConf_redacted]\nClasspath elements:\n[childClasspath]\n

                                                                                                                                                                                                                                                                                                                                                                                                runMain creates and sets a context classloader (based on spark.driver.userClassPathFirst configuration property) and adds the jars (from childClasspath).

                                                                                                                                                                                                                                                                                                                                                                                                runMain loads the main class (childMainClass).

                                                                                                                                                                                                                                                                                                                                                                                                runMain creates a SparkApplication (if the main class is a subtype of) or creates a JavaMainApplication (with the main class).

                                                                                                                                                                                                                                                                                                                                                                                                In the end, runMain requests the SparkApplication to start (with the childArgs and sparkConf).

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-submit/SparkSubmit/#cluster-managers","title":"Cluster Managers

                                                                                                                                                                                                                                                                                                                                                                                                SparkSubmit has a built-in support for some cluster managers (that are selected based on the master argument).

                                                                                                                                                                                                                                                                                                                                                                                                Nickname Master URL KUBERNETES k8s://-prefix LOCAL local-prefix MESOS mesos-prefix STANDALONE spark-prefix YARN yarn","text":""},{"location":"tools/spark-submit/SparkSubmit/#main","title":"Launching Standalone Application
                                                                                                                                                                                                                                                                                                                                                                                                main(\n  args: Array[String]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                main creates a SparkSubmit to doSubmit (with the given args).

                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#doSubmit","title":"doSubmit
                                                                                                                                                                                                                                                                                                                                                                                                doSubmit(\n  args: Array[String]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                doSubmit initializeLogIfNecessary.

                                                                                                                                                                                                                                                                                                                                                                                                doSubmit parses the arguments in the given args (that gives a SparkSubmitArguments).

                                                                                                                                                                                                                                                                                                                                                                                                With verbose option on, doSubmit prints out the appArgs to standard output.

                                                                                                                                                                                                                                                                                                                                                                                                doSubmit branches off based on action.

                                                                                                                                                                                                                                                                                                                                                                                                Action Handler SUBMIT submit KILL kill REQUEST_STATUS requestStatus PRINT_VERSION printVersion

                                                                                                                                                                                                                                                                                                                                                                                                doSubmit is used when:

                                                                                                                                                                                                                                                                                                                                                                                                • InProcessSparkSubmit standalone application is started
                                                                                                                                                                                                                                                                                                                                                                                                • SparkSubmit standalone application is started
                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#parseArguments","title":"Parsing Arguments
                                                                                                                                                                                                                                                                                                                                                                                                parseArguments(\n  args: Array[String]): SparkSubmitArguments\n

                                                                                                                                                                                                                                                                                                                                                                                                parseArguments creates a SparkSubmitArguments (with the given args).

                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#prepareSubmitEnvironment","title":"prepareSubmitEnvironment
                                                                                                                                                                                                                                                                                                                                                                                                prepareSubmitEnvironment(\n  args: SparkSubmitArguments,\n  conf: Option[HadoopConfiguration] = None): (Seq[String], Seq[String], SparkConf, String)\n

                                                                                                                                                                                                                                                                                                                                                                                                prepareSubmitEnvironment creates a 4-element tuple made up of the following:

                                                                                                                                                                                                                                                                                                                                                                                                1. childArgs for arguments
                                                                                                                                                                                                                                                                                                                                                                                                2. childClasspath for Classpath elements
                                                                                                                                                                                                                                                                                                                                                                                                3. sysProps for Spark properties
                                                                                                                                                                                                                                                                                                                                                                                                4. childMainClass

                                                                                                                                                                                                                                                                                                                                                                                                Tip

                                                                                                                                                                                                                                                                                                                                                                                                Use --verbose command-line option to have the elements of the tuple printed out to the standard output.

                                                                                                                                                                                                                                                                                                                                                                                                prepareSubmitEnvironment...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                For isPython in CLIENT deploy mode, prepareSubmitEnvironment sets the following based on primaryResource:

                                                                                                                                                                                                                                                                                                                                                                                                • For pyspark-shell the mainClass is org.apache.spark.api.python.PythonGatewayServer

                                                                                                                                                                                                                                                                                                                                                                                                • Otherwise, the mainClass is org.apache.spark.deploy.PythonRunner and the main python file, extra python files and the childArgs

                                                                                                                                                                                                                                                                                                                                                                                                prepareSubmitEnvironment...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                prepareSubmitEnvironment determines the cluster manager based on master argument.

                                                                                                                                                                                                                                                                                                                                                                                                For KUBERNETES, prepareSubmitEnvironment checkAndGetK8sMasterUrl.

                                                                                                                                                                                                                                                                                                                                                                                                prepareSubmitEnvironment...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                prepareSubmitEnvironment\u00a0is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#childMainClass","title":"childMainClass

                                                                                                                                                                                                                                                                                                                                                                                                childMainClass is the last 4th argument in the result tuple of prepareSubmitEnvironment.

                                                                                                                                                                                                                                                                                                                                                                                                // (childArgs, childClasspath, sparkConf, childMainClass)\n(Seq[String], Seq[String], SparkConf, String)\n

                                                                                                                                                                                                                                                                                                                                                                                                childMainClass can be as follows (based on the deployMode):

                                                                                                                                                                                                                                                                                                                                                                                                Deploy Mode Master URL childMainClass client any mainClass cluster KUBERNETES KubernetesClientApplication cluster MESOS RestSubmissionClientApp (for REST submission API) cluster STANDALONE RestSubmissionClientApp (for REST submission API) cluster STANDALONE ClientApp cluster YARN YarnClusterApplication","text":""},{"location":"tools/spark-submit/SparkSubmit/#iskubernetesclient","title":"isKubernetesClient

                                                                                                                                                                                                                                                                                                                                                                                                prepareSubmitEnvironment uses isKubernetesClient flag to indicate that:

                                                                                                                                                                                                                                                                                                                                                                                                • Cluster manager is Kubernetes
                                                                                                                                                                                                                                                                                                                                                                                                • Deploy mode is client
                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#iskubernetesclustermodedriver","title":"isKubernetesClusterModeDriver

                                                                                                                                                                                                                                                                                                                                                                                                prepareSubmitEnvironment uses isKubernetesClusterModeDriver flag to indicate that:

                                                                                                                                                                                                                                                                                                                                                                                                • isKubernetesClient
                                                                                                                                                                                                                                                                                                                                                                                                • spark.kubernetes.submitInDriver configuration property is enabled (Spark on Kubernetes)
                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#renameresourcestolocalfs","title":"renameResourcesToLocalFS
                                                                                                                                                                                                                                                                                                                                                                                                renameResourcesToLocalFS(\n  resources: String,\n  localResources: String): String\n

                                                                                                                                                                                                                                                                                                                                                                                                renameResourcesToLocalFS...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                renameResourcesToLocalFS is used for isKubernetesClusterModeDriver mode.

                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#downloadresource","title":"downloadResource
                                                                                                                                                                                                                                                                                                                                                                                                downloadResource(\n  resource: String): String\n

                                                                                                                                                                                                                                                                                                                                                                                                downloadResource...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#checking-whether-resource-is-internal","title":"Checking Whether Resource is Internal
                                                                                                                                                                                                                                                                                                                                                                                                isInternal(\n  res: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                isInternal is true when the given res is spark-internal.

                                                                                                                                                                                                                                                                                                                                                                                                isInternal is used when:

                                                                                                                                                                                                                                                                                                                                                                                                • SparkSubmit is requested to isUserJar
                                                                                                                                                                                                                                                                                                                                                                                                • SparkSubmitArguments is requested to handleUnknown
                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#isuserjar","title":"isUserJar
                                                                                                                                                                                                                                                                                                                                                                                                isUserJar(\n  res: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                isUserJar is true when the given res is none of the following:

                                                                                                                                                                                                                                                                                                                                                                                                • isShell
                                                                                                                                                                                                                                                                                                                                                                                                • isPython
                                                                                                                                                                                                                                                                                                                                                                                                • isInternal
                                                                                                                                                                                                                                                                                                                                                                                                • isR

                                                                                                                                                                                                                                                                                                                                                                                                isUserJar is used when:

                                                                                                                                                                                                                                                                                                                                                                                                • FIXME
                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmit/#isPython","title":"isPython
                                                                                                                                                                                                                                                                                                                                                                                                isPython(\n  res: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                isPython is positive (true) when the given res primary resource represents a PySpark application:

                                                                                                                                                                                                                                                                                                                                                                                                • .py script
                                                                                                                                                                                                                                                                                                                                                                                                • pyspark-shell

                                                                                                                                                                                                                                                                                                                                                                                                isPython is used when:

                                                                                                                                                                                                                                                                                                                                                                                                • SparkSubmit is requested to isUserJar
                                                                                                                                                                                                                                                                                                                                                                                                • SparkSubmitArguments is requested to handle an unknown option
                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/","title":"SparkSubmitArguments","text":"

                                                                                                                                                                                                                                                                                                                                                                                                SparkSubmitArguments is created\u00a0 for SparkSubmit to parseArguments.

                                                                                                                                                                                                                                                                                                                                                                                                SparkSubmitArguments is a custom SparkSubmitArgumentsParser to handle the command-line arguments of spark-submit script that the actions use for execution (possibly with the explicit env environment).

                                                                                                                                                                                                                                                                                                                                                                                                SparkSubmitArguments is created when launching spark-submit script with only args passed in and later used for printing the arguments in verbose mode.

                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"tools/spark-submit/SparkSubmitArguments/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                SparkSubmitArguments takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                • Arguments (Seq[String])
                                                                                                                                                                                                                                                                                                                                                                                                • Environment Variables (default: sys.env)

                                                                                                                                                                                                                                                                                                                                                                                                  SparkSubmitArguments is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                  • SparkSubmit is requested to parseArguments
                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"tools/spark-submit/SparkSubmitArguments/#action","title":"Action","text":"
                                                                                                                                                                                                                                                                                                                                                                                                  action: SparkSubmitAction\n

                                                                                                                                                                                                                                                                                                                                                                                                  action is used by SparkSubmit to determine what to do when executed.

                                                                                                                                                                                                                                                                                                                                                                                                  action can be one of the following SparkSubmitActions:

                                                                                                                                                                                                                                                                                                                                                                                                  Action Description SUBMIT The default action if none specified KILL Indicates --kill switch REQUEST_STATUS Indicates --status switch PRINT_VERSION Indicates --version switch

                                                                                                                                                                                                                                                                                                                                                                                                  action is undefined (null) by default (when SparkSubmitAction is created).

                                                                                                                                                                                                                                                                                                                                                                                                  action is validated when validateArguments.

                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"tools/spark-submit/SparkSubmitArguments/#command-line-options","title":"Command-Line Options","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#-files","title":"--files
                                                                                                                                                                                                                                                                                                                                                                                                  • Configuration Property: spark.files
                                                                                                                                                                                                                                                                                                                                                                                                  • Configuration Property (Spark on YARN): spark.yarn.dist.files

                                                                                                                                                                                                                                                                                                                                                                                                  Printed out to standard output for --verbose option

                                                                                                                                                                                                                                                                                                                                                                                                  When SparkSubmit is requested to prepareSubmitEnvironment, the files are:

                                                                                                                                                                                                                                                                                                                                                                                                  • resolveGlobPaths
                                                                                                                                                                                                                                                                                                                                                                                                  • downloadFileList
                                                                                                                                                                                                                                                                                                                                                                                                  • renameResourcesToLocalFS
                                                                                                                                                                                                                                                                                                                                                                                                  • downloadResource
                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#loading-spark-properties","title":"Loading Spark Properties
                                                                                                                                                                                                                                                                                                                                                                                                  loadEnvironmentArguments(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                  loadEnvironmentArguments loads the Spark properties for the current execution of spark-submit.

                                                                                                                                                                                                                                                                                                                                                                                                  loadEnvironmentArguments reads command-line options first followed by Spark properties and System's environment variables.

                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                  Spark config properties start with spark. prefix and can be set using --conf [key=value] command-line option.

                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#handle","title":"Option Handling SparkSubmitOptionParser
                                                                                                                                                                                                                                                                                                                                                                                                  handle(\n  opt: String,\n  value: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                  handle is part of the SparkSubmitOptionParser abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                  handle parses the input opt argument and assigns the given value to corresponding properties.

                                                                                                                                                                                                                                                                                                                                                                                                  In the end, handle returns whether it was executed for any action but PRINT_VERSION.

                                                                                                                                                                                                                                                                                                                                                                                                  User Option (opt) Property --kill action --name name --status action --version action ... ...","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#mergedefaultsparkproperties","title":"mergeDefaultSparkProperties
                                                                                                                                                                                                                                                                                                                                                                                                  mergeDefaultSparkProperties(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                  mergeDefaultSparkProperties merges Spark properties from the default Spark properties file, i.e. spark-defaults.conf with those specified through --conf command-line option.

                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#isPython","title":"isPython
                                                                                                                                                                                                                                                                                                                                                                                                  isPython: Boolean = false\n

                                                                                                                                                                                                                                                                                                                                                                                                  isPython indicates whether the application resource is a PySpark application (a Python script or pyspark shell).

                                                                                                                                                                                                                                                                                                                                                                                                  isPython is isPython when SparkSubmitArguments is requested to handle a unknown option.

                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#client-deploy-mode","title":"Client Deploy Mode

                                                                                                                                                                                                                                                                                                                                                                                                  With isPython flag enabled, SparkSubmit determines the mainClass (and the childArgs) based on the primaryResource.

                                                                                                                                                                                                                                                                                                                                                                                                  primaryResource mainClass pyspark-shell org.apache.spark.api.python.PythonGatewayServer (PySpark) anything else org.apache.spark.deploy.PythonRunner (PySpark)","text":""},{"location":"tools/spark-submit/SparkSubmitCommandBuilder.OptionParser/","title":"SparkSubmitCommandBuilder.OptionParser","text":"

                                                                                                                                                                                                                                                                                                                                                                                                  SparkSubmitCommandBuilder.OptionParser is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/","title":"SparkSubmitCommandBuilder","text":"

                                                                                                                                                                                                                                                                                                                                                                                                  SparkSubmitCommandBuilder is an AbstractCommandBuilder.

                                                                                                                                                                                                                                                                                                                                                                                                  SparkSubmitCommandBuilder is used to build a command that spark-submit and SparkLauncher use to launch a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                  SparkSubmitCommandBuilder uses the first argument to distinguish the shells:

                                                                                                                                                                                                                                                                                                                                                                                                  1. pyspark-shell-main
                                                                                                                                                                                                                                                                                                                                                                                                  2. sparkr-shell-main
                                                                                                                                                                                                                                                                                                                                                                                                  3. run-example

                                                                                                                                                                                                                                                                                                                                                                                                  SparkSubmitCommandBuilder parses command-line arguments using OptionParser (which is a spark-submit-SparkSubmitOptionParser.md[SparkSubmitOptionParser]). OptionParser comes with the following methods:

                                                                                                                                                                                                                                                                                                                                                                                                  1. handle to handle the known options (see the table below). It sets up master, deployMode, propertiesFile, conf, mainClass, sparkArgs internal properties.

                                                                                                                                                                                                                                                                                                                                                                                                  2. handleUnknown to handle unrecognized options that usually lead to Unrecognized option error message.

                                                                                                                                                                                                                                                                                                                                                                                                  3. handleExtraArgs to handle extra arguments that are considered a Spark application's arguments.

                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                  For spark-shell it assumes that the application arguments are after spark-submit's arguments.

                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#pyspark-shell-main","title":"pyspark-shell-main Application Resource

                                                                                                                                                                                                                                                                                                                                                                                                  When bin/pyspark shell script (and bin\\pyspark2.cmd) are launched, they use bin/spark-submit with pyspark-shell-main application resource as the first argument (followed by --name \"PySparkShell\" option among the others).

                                                                                                                                                                                                                                                                                                                                                                                                  pyspark-shell-main is used when:

                                                                                                                                                                                                                                                                                                                                                                                                  • SparkSubmitCommandBuilder is created and then requested to build a command (buildPySparkShellCommand actually)
                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#buildCommand","title":"Building Command AbstractCommandBuilder
                                                                                                                                                                                                                                                                                                                                                                                                  List<String> buildCommand(\n  Map<String, String> env)\n

                                                                                                                                                                                                                                                                                                                                                                                                  buildCommand is part of the AbstractCommandBuilder abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                  buildCommand branches off based on the application resource.

                                                                                                                                                                                                                                                                                                                                                                                                  Application Resource Command Builder pyspark-shell-main (but not isSpecialCommand) buildPySparkShellCommand sparkr-shell-main (but not isSpecialCommand) buildSparkRCommand anything else buildSparkSubmitCommand","text":""},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#buildPySparkShellCommand","title":"buildPySparkShellCommand","text":"
                                                                                                                                                                                                                                                                                                                                                                                                  List<String> buildPySparkShellCommand(\n  Map<String, String> env)\n
                                                                                                                                                                                                                                                                                                                                                                                                  appArgs expected to be empty

                                                                                                                                                                                                                                                                                                                                                                                                  buildPySparkShellCommand makes sure that:

                                                                                                                                                                                                                                                                                                                                                                                                  • There are no appArgs
                                                                                                                                                                                                                                                                                                                                                                                                  • If there are appArgs the first argument is not a Python script (a file with .py extension)

                                                                                                                                                                                                                                                                                                                                                                                                  buildPySparkShellCommand sets the application resource as pyspark-shell.

                                                                                                                                                                                                                                                                                                                                                                                                  pyspark-shell-main redefined to pyspark-shell

                                                                                                                                                                                                                                                                                                                                                                                                  buildPySparkShellCommand is executed when requested for a command with pyspark-shell-main application resource that is re-defined (reset) to pyspark-shell now.

                                                                                                                                                                                                                                                                                                                                                                                                  buildPySparkShellCommand constructEnvVarArgs with the given env and PYSPARK_SUBMIT_ARGS.

                                                                                                                                                                                                                                                                                                                                                                                                  buildPySparkShellCommand defines an internal pyargs collection for the parts of the shell command to execute.

                                                                                                                                                                                                                                                                                                                                                                                                  buildPySparkShellCommand stores the Python executable (in pyargs) to be the first specified in the following order:

                                                                                                                                                                                                                                                                                                                                                                                                  • spark.pyspark.driver.python configuration property
                                                                                                                                                                                                                                                                                                                                                                                                  • spark.pyspark.python configuration property
                                                                                                                                                                                                                                                                                                                                                                                                  • PYSPARK_DRIVER_PYTHON environment variable
                                                                                                                                                                                                                                                                                                                                                                                                  • PYSPARK_PYTHON environment variable
                                                                                                                                                                                                                                                                                                                                                                                                  • python3

                                                                                                                                                                                                                                                                                                                                                                                                  buildPySparkShellCommand sets the environment variables (for the Python executable to use), if specified.

                                                                                                                                                                                                                                                                                                                                                                                                  Environment Variable Configuration Property PYSPARK_PYTHON spark.pyspark.python SPARK_REMOTE remote option or spark.remote

                                                                                                                                                                                                                                                                                                                                                                                                  In the end, buildPySparkShellCommand copies all the options from PYSPARK_DRIVER_PYTHON_OPTS, if specified.

                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#buildSparkSubmitCommand","title":"buildSparkSubmitCommand","text":"
                                                                                                                                                                                                                                                                                                                                                                                                  List<String> buildSparkSubmitCommand(\n  Map<String, String> env)\n

                                                                                                                                                                                                                                                                                                                                                                                                  buildSparkSubmitCommand starts by building so-called effective config. When in client mode, buildSparkSubmitCommand adds spark.driver.extraClassPath to the result Spark command.

                                                                                                                                                                                                                                                                                                                                                                                                  buildSparkSubmitCommand builds the first part of the Java command passing in the extra classpath (only for client deploy mode).

                                                                                                                                                                                                                                                                                                                                                                                                  Add isThriftServer case

                                                                                                                                                                                                                                                                                                                                                                                                  buildSparkSubmitCommand appends SPARK_SUBMIT_OPTS and SPARK_JAVA_OPTS environment variables.

                                                                                                                                                                                                                                                                                                                                                                                                  (only for client deploy mode) ...

                                                                                                                                                                                                                                                                                                                                                                                                  Elaborate on the client deply mode case

                                                                                                                                                                                                                                                                                                                                                                                                  addPermGenSizeOpt case...elaborate

                                                                                                                                                                                                                                                                                                                                                                                                  Elaborate on addPermGenSizeOpt

                                                                                                                                                                                                                                                                                                                                                                                                  buildSparkSubmitCommand appends org.apache.spark.deploy.SparkSubmit and the command-line arguments (using buildSparkSubmitArgs).

                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#buildsparksubmitargs","title":"buildSparkSubmitArgs
                                                                                                                                                                                                                                                                                                                                                                                                  List<String> buildSparkSubmitArgs()\n

                                                                                                                                                                                                                                                                                                                                                                                                  buildSparkSubmitArgs builds a list of command-line arguments for spark-submit.

                                                                                                                                                                                                                                                                                                                                                                                                  buildSparkSubmitArgs uses a SparkSubmitOptionParser to add the command-line arguments that spark-submit recognizes (when it is executed later on and uses the very same SparkSubmitOptionParser parser to parse command-line arguments).

                                                                                                                                                                                                                                                                                                                                                                                                  buildSparkSubmitArgs is used when:

                                                                                                                                                                                                                                                                                                                                                                                                  • InProcessLauncher is requested to startApplication
                                                                                                                                                                                                                                                                                                                                                                                                  • SparkLauncher is requested to createBuilder
                                                                                                                                                                                                                                                                                                                                                                                                  • SparkSubmitCommandBuilder is requested to buildSparkSubmitCommand and constructEnvVarArgs
                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#sparksubmitcommandbuilder-properties-and-sparksubmitoptionparser-attributes","title":"SparkSubmitCommandBuilder Properties and SparkSubmitOptionParser Attributes SparkSubmitCommandBuilder Property SparkSubmitOptionParser Attribute verbose VERBOSE master MASTER [master] deployMode DEPLOY_MODE [deployMode] appName NAME [appName] conf CONF [key=value]* propertiesFile PROPERTIES_FILE [propertiesFile] jars JARS [comma-separated jars] files FILES [comma-separated files] pyFiles PY_FILES [comma-separated pyFiles] mainClass CLASS [mainClass] sparkArgs sparkArgs (passed straight through) appResource appResource (passed straight through) appArgs appArgs (passed straight through)","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/","title":"SparkSubmitOperation","text":"

                                                                                                                                                                                                                                                                                                                                                                                                  SparkSubmitOperation is an abstraction of operations of spark-submit (when requested to kill a submission or for a submission status).

                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"tools/spark-submit/SparkSubmitOperation/#contract","title":"Contract","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/#killing-submission","title":"Killing Submission
                                                                                                                                                                                                                                                                                                                                                                                                  kill(\n  submissionId: String,\n  conf: SparkConf): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                  Kills a given submission

                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                  • SparkSubmit is requested to kill a submission
                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/#displaying-submission-status","title":"Displaying Submission Status
                                                                                                                                                                                                                                                                                                                                                                                                  printSubmissionStatus(\n  submissionId: String,\n  conf: SparkConf): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                  Displays status of a given submission

                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                  • SparkSubmit is requested for submission status
                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/#checking-whether-master-url-supported","title":"Checking Whether Master URL Supported
                                                                                                                                                                                                                                                                                                                                                                                                  supports(\n  master: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                  • SparkSubmit is requested to kill a submission and for a submission status (via getSubmitOperations utility)
                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                  • K8SSparkSubmitOperation (Spark on Kubernetes)
                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"tools/spark-submit/SparkSubmitOptionParser/","title":"SparkSubmitOptionParser","text":"

                                                                                                                                                                                                                                                                                                                                                                                                  SparkSubmitOptionParser is the parser of spark-submit's command-line options.

                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"tools/spark-submit/SparkSubmitOptionParser/#parse","title":"Parsing Arguments","text":"
                                                                                                                                                                                                                                                                                                                                                                                                  void parse(\n  List<String> args)\n

                                                                                                                                                                                                                                                                                                                                                                                                  parse...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                  parse is used when:

                                                                                                                                                                                                                                                                                                                                                                                                  • AbstractLauncher is requested to addSparkArg
                                                                                                                                                                                                                                                                                                                                                                                                  • Main is launched
                                                                                                                                                                                                                                                                                                                                                                                                  • SparkSubmitCommandBuilder is created and requested to buildSparkSubmitArgs
                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"tools/spark-submit/SparkSubmitOptionParser/#handle","title":"Option Handling","text":"
                                                                                                                                                                                                                                                                                                                                                                                                  boolean handle(\n  String opt,\n  String value)\n

                                                                                                                                                                                                                                                                                                                                                                                                  handle throws an UnsupportedOperationException (and expects subclasses to override the default behaviour, e.g. SparkSubmitArguments).

                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"tools/spark-submit/SparkSubmitOptionParser/#-files","title":"--files

                                                                                                                                                                                                                                                                                                                                                                                                  A comma-separated sequence of paths

                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"tools/spark-submit/SparkSubmitUtils/","title":"SparkSubmitUtils","text":"

                                                                                                                                                                                                                                                                                                                                                                                                  SparkSubmitUtils provides utilities for SparkSubmit.

                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"tools/spark-submit/SparkSubmitUtils/#getsubmitoperations","title":"getSubmitOperations
                                                                                                                                                                                                                                                                                                                                                                                                  getSubmitOperations(\n  master: String): SparkSubmitOperation\n

                                                                                                                                                                                                                                                                                                                                                                                                  getSubmitOperations...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                  getSubmitOperations\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                  • SparkSubmit is requested to kill a submission and requestStatus
                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"webui/","title":"Web UIs","text":"

                                                                                                                                                                                                                                                                                                                                                                                                  web UI is the web interface of Spark applications or infrastructure for monitoring and inspection.

                                                                                                                                                                                                                                                                                                                                                                                                  The main abstraction is WebUI.

                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/AllJobsPage/","title":"AllJobsPage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                  AllJobsPage is a WebUIPage of JobsTab.

                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/AllJobsPage/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                  AllJobsPage takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                  • Parent JobsTab
                                                                                                                                                                                                                                                                                                                                                                                                  • AppStatusStore"},{"location":"webui/AllJobsPage/#rendering-page","title":"Rendering Page
                                                                                                                                                                                                                                                                                                                                                                                                    render(\n  request: HttpServletRequest): Seq[Node]\n

                                                                                                                                                                                                                                                                                                                                                                                                    render\u00a0is part of the WebUIPage abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                    render renders a Spark Jobs page with the jobs and executors alongside applicationInfo and appSummary (from the AppStatusStore).

                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"webui/AllJobsPage/#introduction","title":"Introduction

                                                                                                                                                                                                                                                                                                                                                                                                    AllJobsPage renders a summary, an event timeline, and active, completed, and failed jobs of a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                    AllJobsPage displays the Summary section with the current Spark user, total uptime, scheduling mode, and the number of jobs per status.

                                                                                                                                                                                                                                                                                                                                                                                                    Under the summary section is the Event Timeline section.

                                                                                                                                                                                                                                                                                                                                                                                                    Active Jobs, Completed Jobs, and Failed Jobs sections follow.

                                                                                                                                                                                                                                                                                                                                                                                                    Jobs are clickable (and give information about the stages of tasks inside it).

                                                                                                                                                                                                                                                                                                                                                                                                    When you hover over a job in Event Timeline not only you see the job legend but also the job is highlighted in the Summary section.

                                                                                                                                                                                                                                                                                                                                                                                                    The Event Timeline section shows not only jobs but also executors.

                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"webui/AllStagesPage/","title":"AllStagesPage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                    AllStagesPage is a WebUIPage of StagesTab.

                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"webui/AllStagesPage/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                    AllStagesPage takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                    • Parent StagesTab"},{"location":"webui/AllStagesPage/#rendering-page","title":"Rendering Page
                                                                                                                                                                                                                                                                                                                                                                                                      render(\n  request: HttpServletRequest): Seq[Node]\n

                                                                                                                                                                                                                                                                                                                                                                                                      render\u00a0is part of the WebUIPage abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                      render renders a Stages for All Jobs page with the stages and application summary (from the AppStatusStore of the parent StagesTab).

                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"webui/AllStagesPage/#stage-headers","title":"Stage Headers

                                                                                                                                                                                                                                                                                                                                                                                                      AllStagesPage uses the following headers and tooltips for the Stages table.

                                                                                                                                                                                                                                                                                                                                                                                                      Header Tooltip Stage Id Pool Name Description Submitted Duration Elapsed time since the stage was submitted until execution completion of all its tasks. Tasks: Succeeded/Total Input Bytes read from Hadoop or from Spark storage. Output Bytes written to Hadoop. Shuffle Read Total shuffle bytes and records read (includes both data read locally and data read from remote executors). Shuffle Write Bytes and records written to disk in order to be read by a shuffle in a future stage. Failure Reason Bytes and records written to disk in order to be read by a shuffle in a future stage.","text":""},{"location":"webui/EnvironmentPage/","title":"EnvironmentPage","text":""},{"location":"webui/EnvironmentPage/#review-me","title":"Review Me","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      [[prefix]] EnvironmentPage is a spark-webui-WebUIPage.md[WebUIPage] with an empty spark-webui-WebUIPage.md#prefix[prefix].

                                                                                                                                                                                                                                                                                                                                                                                                      EnvironmentPage is <> exclusively when EnvironmentTab is spark-webui-EnvironmentTab.md#creating-instance[created].

                                                                                                                                                                                                                                                                                                                                                                                                      == [[creating-instance]] Creating EnvironmentPage Instance

                                                                                                                                                                                                                                                                                                                                                                                                      EnvironmentPage takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                      • [[parent]] Parent spark-webui-EnvironmentTab.md[EnvironmentTab]
                                                                                                                                                                                                                                                                                                                                                                                                      • [[conf]] SparkConf.md[SparkConf]
                                                                                                                                                                                                                                                                                                                                                                                                      • [[store]] core:AppStatusStore.md[]
                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/EnvironmentTab/","title":"EnvironmentTab","text":""},{"location":"webui/EnvironmentTab/#review-me","title":"Review Me","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      [[prefix]] EnvironmentTab is a spark-webui-SparkUITab.md[SparkUITab] with environment spark-webui-SparkUITab.md#prefix[prefix].

                                                                                                                                                                                                                                                                                                                                                                                                      EnvironmentTab is <> exclusively when SparkUI is spark-webui-SparkUI.md#initialize[initialized].

                                                                                                                                                                                                                                                                                                                                                                                                      [[creating-instance]] EnvironmentTab takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                      • [[parent]] Parent spark-webui-SparkUI.md[SparkUI]
                                                                                                                                                                                                                                                                                                                                                                                                      • [[store]] core:AppStatusStore.md[]

                                                                                                                                                                                                                                                                                                                                                                                                      When created, EnvironmentTab creates the spark-webui-EnvironmentPage.md#creating-instance[EnvironmentPage] page and spark-webui-WebUITab.md#attachPage[attaches] it immediately.

                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/ExecutorThreadDumpPage/","title":"ExecutorThreadDumpPage","text":""},{"location":"webui/ExecutorThreadDumpPage/#review-me","title":"Review Me","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      [[prefix]] ExecutorThreadDumpPage is a spark-webui-WebUIPage.md[WebUIPage] with threadDump spark-webui-WebUIPage.md#prefix[prefix].

                                                                                                                                                                                                                                                                                                                                                                                                      ExecutorThreadDumpPage is <> exclusively when ExecutorsTab is spark-webui-ExecutorsTab.md#creating-instance[created] (with spark.ui.threadDumpsEnabled configuration property enabled).

                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: spark.ui.threadDumpsEnabled configuration property is enabled (i.e. true) by default.

                                                                                                                                                                                                                                                                                                                                                                                                      === [[creating-instance]] Creating ExecutorThreadDumpPage Instance

                                                                                                                                                                                                                                                                                                                                                                                                      ExecutorThreadDumpPage takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                      • [[parent]] spark-webui-SparkUITab.md[SparkUITab]
                                                                                                                                                                                                                                                                                                                                                                                                      • [[sc]] Optional SparkContext.md[]
                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/ExecutorsPage/","title":"ExecutorsPage","text":""},{"location":"webui/ExecutorsPage/#review-me","title":"Review Me","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      [[prefix]] ExecutorsPage is a spark-webui-WebUIPage.md[WebUIPage] with an empty spark-webui-WebUIPage.md#prefix[prefix].

                                                                                                                                                                                                                                                                                                                                                                                                      ExecutorsPage is <> exclusively when ExecutorsTab is spark-webui-ExecutorsTab.md#creating-instance[created].

                                                                                                                                                                                                                                                                                                                                                                                                      === [[creating-instance]] Creating ExecutorsPage Instance

                                                                                                                                                                                                                                                                                                                                                                                                      ExecutorsPage takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                      • [[parent]] Parent spark-webui-SparkUITab.md[SparkUITab]
                                                                                                                                                                                                                                                                                                                                                                                                      • [[threadDumpEnabled]] threadDumpEnabled flag
                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/ExecutorsTab/","title":"ExecutorsTab","text":""},{"location":"webui/ExecutorsTab/#review-me","title":"Review Me","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      [[prefix]] ExecutorsTab is a spark-webui-SparkUITab.md[SparkUITab] with executors spark-webui-SparkUITab.md#prefix[prefix].

                                                                                                                                                                                                                                                                                                                                                                                                      ExecutorsTab is <> exclusively when SparkUI is spark-webui-SparkUI.md#initialize[initialized].

                                                                                                                                                                                                                                                                                                                                                                                                      [[creating-instance]] [[parent]] ExecutorsTab takes the parent spark-webui-SparkUI.md[SparkUI] when created.

                                                                                                                                                                                                                                                                                                                                                                                                      When <>, ExecutorsTab creates the following pages and spark-webui-WebUITab.md#attachPage[attaches] them immediately:

                                                                                                                                                                                                                                                                                                                                                                                                      • spark-webui-ExecutorsPage.md[ExecutorsPage]

                                                                                                                                                                                                                                                                                                                                                                                                      • spark-webui-ExecutorThreadDumpPage.md[ExecutorThreadDumpPage]

                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/JettyUtils/","title":"JettyUtils","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      == [[JettyUtils]] JettyUtils

                                                                                                                                                                                                                                                                                                                                                                                                      JettyUtils is a set of <> for creating Jetty HTTP Server-specific components.

                                                                                                                                                                                                                                                                                                                                                                                                      [[utility-methods]] .JettyUtils's Utility Methods [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                                                                                                                                                                                                                                                                                      | <> | Creates an HttpServlet

                                                                                                                                                                                                                                                                                                                                                                                                      | <> | Creates a Handler for a static content

                                                                                                                                                                                                                                                                                                                                                                                                      | <> | Creates a ServletContextHandler for a path <> ===

                                                                                                                                                                                                                                                                                                                                                                                                      === [[createServletHandler]] Creating ServletContextHandler for Path -- createServletHandler Method

                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/JettyUtils/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      createServletHandler( path: String, servlet: HttpServlet, basePath: String): ServletContextHandler createServletHandlerT <: AnyRef: ServletContextHandler // <1>

                                                                                                                                                                                                                                                                                                                                                                                                      <1> Uses the first three-argument createServletHandler

                                                                                                                                                                                                                                                                                                                                                                                                      createServletHandler...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/JettyUtils/#note","title":"[NOTE]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      createServletHandler is used when:

                                                                                                                                                                                                                                                                                                                                                                                                      • WebUI is requested to spark-webui-WebUI.md#attachPage[attachPage]

                                                                                                                                                                                                                                                                                                                                                                                                      • MetricsServlet is requested to getHandlers

                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/JettyUtils/#spark-standalones-workerwebui-is-requested-to-initialize","title":"* Spark Standalone's WorkerWebUI is requested to initialize","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      === [[createServlet]] Creating HttpServlet -- createServlet Method

                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/JettyUtils/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      createServletT <: AnyRef: HttpServlet

                                                                                                                                                                                                                                                                                                                                                                                                      createServlet creates the X-Frame-Options header that can be either ALLOW-FROM with the value of spark-webui-properties.md#spark.ui.allowFramingFrom[spark.ui.allowFramingFrom] configuration property if defined or SAMEORIGIN.

                                                                                                                                                                                                                                                                                                                                                                                                      createServlet creates a Java Servlets HttpServlet with support for GET requests.

                                                                                                                                                                                                                                                                                                                                                                                                      When handling GET requests, the HttpServlet first checks view permissions of the remote user (by requesting the SecurityManager to checkUIViewPermissions of the remote user).

                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/JettyUtils/#tip","title":"[TIP]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      Enable DEBUG logging level for org.apache.spark.SecurityManager logger to see what happens when SecurityManager does the security check.

                                                                                                                                                                                                                                                                                                                                                                                                      Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                      log4j.logger.org.apache.spark.SecurityManager=DEBUG\n

                                                                                                                                                                                                                                                                                                                                                                                                      You should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/JettyUtils/#debug-securitymanager-useruser-aclsenabledaclsenabled-viewaclsviewacls-viewaclsgroupsviewaclsgroups","title":"
                                                                                                                                                                                                                                                                                                                                                                                                      DEBUG SecurityManager: user=[user] aclsEnabled=[aclsEnabled] viewAcls=[viewAcls] viewAclsGroups=[viewAclsGroups]\n
                                                                                                                                                                                                                                                                                                                                                                                                      ","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      With view permissions check passed, the HttpServlet sends a response with the following:

                                                                                                                                                                                                                                                                                                                                                                                                      • FIXME

                                                                                                                                                                                                                                                                                                                                                                                                      In case the view permissions didn't allow to view the page, the HttpServlet sends an error response with the following:

                                                                                                                                                                                                                                                                                                                                                                                                      • Status 403

                                                                                                                                                                                                                                                                                                                                                                                                      • Cache-Control header with \"no-cache, no-store, must-revalidate\"

                                                                                                                                                                                                                                                                                                                                                                                                      • Error message: \"User is not authorized to access this page.\"

                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: createServlet is used exclusively when JettyUtils is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                      === [[createStaticHandler]] Creating Handler For Static Content -- createStaticHandler Method

                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/JettyUtils/#source-scala_2","title":"[source, scala]","text":""},{"location":"webui/JettyUtils/#createstatichandlerresourcebase-string-path-string-servletcontexthandler","title":"createStaticHandler(resourceBase: String, path: String): ServletContextHandler","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      createStaticHandler creates a handler for serving files from a static directory

                                                                                                                                                                                                                                                                                                                                                                                                      Internally, createStaticHandler creates a Jetty ServletContextHandler and sets org.eclipse.jetty.servlet.Default.gzip init parameter to false.

                                                                                                                                                                                                                                                                                                                                                                                                      createRedirectHandler creates a Jetty https://www.eclipse.org/jetty/javadoc/current/org/eclipse/jetty/servlet/DefaultServlet.html[DefaultServlet].

                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/JettyUtils/#note_1","title":"[NOTE]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      Quoting the official documentation of Jetty's https://www.eclipse.org/jetty/javadoc/current/org/eclipse/jetty/servlet/DefaultServlet.html[DefaultServlet]:

                                                                                                                                                                                                                                                                                                                                                                                                      DefaultServlet The default servlet. This servlet, normally mapped to /, provides the handling for static content, OPTION and TRACE methods for the context. The following initParameters are supported, these can be set either on the servlet itself or as ServletContext initParameters with a prefix of org.eclipse.jetty.servlet.Default.

                                                                                                                                                                                                                                                                                                                                                                                                      With that, org.eclipse.jetty.servlet.Default.gzip is to configure https://www.eclipse.org/jetty/documentation/current/advanced-extras.html#default-servlet-init[gzip] init parameter for Jetty's DefaultServlet.

                                                                                                                                                                                                                                                                                                                                                                                                      gzip If set to true, then static content will be served as gzip content encoded if a matching resource is found ending with \".gz\" (default false) (deprecated: use precompressed)

                                                                                                                                                                                                                                                                                                                                                                                                      ====

                                                                                                                                                                                                                                                                                                                                                                                                      createRedirectHandler resolves the resourceBase in the Spark classloader and, if successful, sets resourceBase init parameter of the Jetty DefaultServlet to the URL.

                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: https://www.eclipse.org/jetty/documentation/current/advanced-extras.html#default-servlet-init[resourceBase] init parameter is used to replace the context resource base.

                                                                                                                                                                                                                                                                                                                                                                                                      createRedirectHandler requests the ServletContextHandler to use the path as the context path and register the DefaultServlet to serve it.

                                                                                                                                                                                                                                                                                                                                                                                                      createRedirectHandler throws an Exception if the input resourceBase could not be resolved.

                                                                                                                                                                                                                                                                                                                                                                                                      Could not find resource path for Web UI: [resourceBase]\n

                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: createStaticHandler is used when spark-webui-SparkUI.md#initialize[SparkUI], spark-history-server:HistoryServer.md#initialize[HistoryServer], Spark Standalone's MasterWebUI and WorkerWebUI, Spark on Mesos' MesosClusterUI are requested to initialize.

                                                                                                                                                                                                                                                                                                                                                                                                      === [[createRedirectHandler]] createRedirectHandler Method

                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/JettyUtils/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      createRedirectHandler( srcPath: String, destPath: String, beforeRedirect: HttpServletRequest => Unit = x => (), basePath: String = \"\", httpMethods: Set[String] = Set(\"GET\")): ServletContextHandler

                                                                                                                                                                                                                                                                                                                                                                                                      createRedirectHandler...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: createRedirectHandler is used when spark-webui-SparkUI.md#initialize[SparkUI] and Spark Standalone's MasterWebUI are requested to initialize.

                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/JobPage/","title":"JobPage","text":""},{"location":"webui/JobPage/#review-me","title":"Review Me","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      [[prefix]] JobPage is a spark-webui-WebUIPage.md[WebUIPage] with job spark-webui-WebUIPage.md#prefix[prefix].

                                                                                                                                                                                                                                                                                                                                                                                                      JobPage is <> exclusively when JobsTab is created.

                                                                                                                                                                                                                                                                                                                                                                                                      === [[creating-instance]] Creating JobPage Instance

                                                                                                                                                                                                                                                                                                                                                                                                      JobPage takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                      • [[parent]] Parent JobsTab
                                                                                                                                                                                                                                                                                                                                                                                                      • [[store]] core:AppStatusStore.md[]
                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/JobsTab/","title":"JobsTab","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      JobsTab is a SparkUITab with jobs URL prefix.

                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/JobsTab/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                      JobsTab takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                      • Parent SparkUI
                                                                                                                                                                                                                                                                                                                                                                                                      • AppStatusStore

                                                                                                                                                                                                                                                                                                                                                                                                        JobsTab is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                        • SparkUI is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"webui/JobsTab/#pages","title":"Pages","text":"

                                                                                                                                                                                                                                                                                                                                                                                                        When created, JobsTab attaches the following pages (with a reference to itself and the AppStatusStore):

                                                                                                                                                                                                                                                                                                                                                                                                        • AllJobsPage
                                                                                                                                                                                                                                                                                                                                                                                                        • JobPage
                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"webui/JobsTab/#event-timeline","title":"Event Timeline","text":""},{"location":"webui/JobsTab/#details-for-job","title":"Details for Job","text":"

                                                                                                                                                                                                                                                                                                                                                                                                        Clicking a job in AllJobsPage, leads to Details for Job page.

                                                                                                                                                                                                                                                                                                                                                                                                        When a job id is not found, you should see \"No information to display for job ID\" message.

                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"webui/PoolPage/","title":"PoolPage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                        PoolPage is a WebUIPage of StagesTab.

                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"webui/PoolPage/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                        PoolPage takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                        • Parent StagesTab"},{"location":"webui/PoolPage/#url-prefix","title":"URL Prefix

                                                                                                                                                                                                                                                                                                                                                                                                          PoolPage uses pool URL prefix.

                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/PoolPage/#rendering-page","title":"Rendering Page
                                                                                                                                                                                                                                                                                                                                                                                                          render(\n  request: HttpServletRequest): Seq[Node]\n

                                                                                                                                                                                                                                                                                                                                                                                                          render\u00a0is part of the WebUIPage abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                          render requires poolname and attempt request parameters.

                                                                                                                                                                                                                                                                                                                                                                                                          render renders a Fair Scheduler Pool page with the PoolData (from the AppStatusStore of the parent StagesTab).

                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/PoolPage/#introduction","title":"Introduction

                                                                                                                                                                                                                                                                                                                                                                                                          The Fair Scheduler Pool Details page shows information about a Schedulable pool and is only available when a Spark application uses the FAIR scheduling mode.

                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/PoolPage/#summary-table","title":"Summary Table","text":"

                                                                                                                                                                                                                                                                                                                                                                                                          The Summary table shows the details of a Schedulable pool.

                                                                                                                                                                                                                                                                                                                                                                                                          It uses the following columns:

                                                                                                                                                                                                                                                                                                                                                                                                          • Pool Name
                                                                                                                                                                                                                                                                                                                                                                                                          • Minimum Share
                                                                                                                                                                                                                                                                                                                                                                                                          • Pool Weight
                                                                                                                                                                                                                                                                                                                                                                                                          • Active Stages (the number of the active stages in a Schedulable pool)
                                                                                                                                                                                                                                                                                                                                                                                                          • Running Tasks
                                                                                                                                                                                                                                                                                                                                                                                                          • SchedulingMode
                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"webui/PoolPage/#active-stages-table","title":"Active Stages Table","text":"

                                                                                                                                                                                                                                                                                                                                                                                                          The Active Stages table shows the active stages in a pool.

                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"webui/PrometheusResource/","title":"PrometheusResource","text":""},{"location":"webui/PrometheusResource/#getservlethandler","title":"getServletHandler
                                                                                                                                                                                                                                                                                                                                                                                                          getServletHandler(\n  uiRoot: UIRoot): ServletContextHandler\n

                                                                                                                                                                                                                                                                                                                                                                                                          getServletHandler...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                          getServletHandler\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                          • SparkUI is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/RDDPage/","title":"RDDPage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                          == [[RDDPage]] RDDPage

                                                                                                                                                                                                                                                                                                                                                                                                          [[prefix]] RDDPage is a spark-webui-WebUIPage.md[WebUIPage] with rdd spark-webui-WebUIPage.md#prefix[prefix].

                                                                                                                                                                                                                                                                                                                                                                                                          RDDPage is <> exclusively when StorageTab is spark-webui-StorageTab.md#creating-instance[created].

                                                                                                                                                                                                                                                                                                                                                                                                          [[creating-instance]] RDDPage takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                          • [[parent]] Parent spark-webui-SparkUITab.md[SparkUITab]
                                                                                                                                                                                                                                                                                                                                                                                                          • [[store]] core:AppStatusStore.md[]

                                                                                                                                                                                                                                                                                                                                                                                                          === [[render]] render Method

                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"webui/RDDPage/#source-scala","title":"[source, scala]","text":""},{"location":"webui/RDDPage/#renderrequest-httpservletrequest-seqnode","title":"render(request: HttpServletRequest): Seq[Node]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: render is part of spark-webui-WebUIPage.md#render[WebUIPage Contract] to...FIXME.

                                                                                                                                                                                                                                                                                                                                                                                                          render...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"webui/SparkUI/","title":"SparkUI","text":"

                                                                                                                                                                                                                                                                                                                                                                                                          SparkUI is a WebUI of Spark applications.

                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"webui/SparkUI/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                          SparkUI takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                          • AppStatusStore
                                                                                                                                                                                                                                                                                                                                                                                                          • SparkContext
                                                                                                                                                                                                                                                                                                                                                                                                          • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                          • SecurityManager
                                                                                                                                                                                                                                                                                                                                                                                                          • Application Name
                                                                                                                                                                                                                                                                                                                                                                                                          • Base Path
                                                                                                                                                                                                                                                                                                                                                                                                          • Start Time
                                                                                                                                                                                                                                                                                                                                                                                                          • Spark Version

                                                                                                                                                                                                                                                                                                                                                                                                            While being created, SparkUI initializes itself.

                                                                                                                                                                                                                                                                                                                                                                                                            SparkUI is created using create utility.

                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"webui/SparkUI/#ui-port","title":"UI Port
                                                                                                                                                                                                                                                                                                                                                                                                            getUIPort(\n  conf: SparkConf): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                            getUIPort requests the SparkConf for the value of spark.ui.port configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                            getUIPort\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                            • SparkUI is created
                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"webui/SparkUI/#creating-sparkui","title":"Creating SparkUI
                                                                                                                                                                                                                                                                                                                                                                                                            create(\n  sc: Option[SparkContext],\n  store: AppStatusStore,\n  conf: SparkConf,\n  securityManager: SecurityManager,\n  appName: String,\n  basePath: String,\n  startTime: Long,\n  appSparkVersion: String): SparkUI\n

                                                                                                                                                                                                                                                                                                                                                                                                            create creates a new SparkUI with appSparkVersion being the current Spark version.

                                                                                                                                                                                                                                                                                                                                                                                                            create\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                            • SparkContext is created (with the spark.ui.enabled configuration property turned on)
                                                                                                                                                                                                                                                                                                                                                                                                            • FsHistoryProvider (Spark History Server) is requested for the web UI of a Spark application
                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"webui/SparkUI/#initializing","title":"Initializing
                                                                                                                                                                                                                                                                                                                                                                                                            initialize(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                            initialize\u00a0is part of the WebUI abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                            initialize creates and attaches the following tabs:

                                                                                                                                                                                                                                                                                                                                                                                                            1. JobsTab
                                                                                                                                                                                                                                                                                                                                                                                                            2. StagesTab
                                                                                                                                                                                                                                                                                                                                                                                                            3. StorageTab
                                                                                                                                                                                                                                                                                                                                                                                                            4. EnvironmentTab
                                                                                                                                                                                                                                                                                                                                                                                                            5. ExecutorsTab

                                                                                                                                                                                                                                                                                                                                                                                                            initialize attaches itself as the UIRoot.

                                                                                                                                                                                                                                                                                                                                                                                                            initialize attaches the PrometheusResource for executor metrics based on spark.ui.prometheus.enabled configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"webui/SparkUI/#uiroot","title":"UIRoot

                                                                                                                                                                                                                                                                                                                                                                                                            SparkUI is an UIRoot

                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"webui/SparkUI/#review-me","title":"Review Me

                                                                                                                                                                                                                                                                                                                                                                                                            SparkUI is <> along with the following:

                                                                                                                                                                                                                                                                                                                                                                                                            • SparkContext is created (for a live Spark application with spark-webui-properties.md#spark.ui.enabled[spark.ui.enabled] configuration property enabled)

                                                                                                                                                                                                                                                                                                                                                                                                            • FsHistoryProvider is requested for the spark-history-server:FsHistoryProvider.md#getAppUI[application UI] (for a live or completed Spark application)

                                                                                                                                                                                                                                                                                                                                                                                                            .Creating SparkUI for Live Spark Application image::spark-webui-SparkUI.png[align=\"center\"]

                                                                                                                                                                                                                                                                                                                                                                                                            When <> (while SparkContext is created for a live Spark application), SparkUI gets the following:

                                                                                                                                                                                                                                                                                                                                                                                                            • Live AppStatusStore (with a ElementTrackingStore using an core:InMemoryStore.md[] and a AppStatusListener for a live Spark application)

                                                                                                                                                                                                                                                                                                                                                                                                            • Name of the Spark application that is exactly the value of SparkConf.md#spark.app.name[spark.app.name] configuration property

                                                                                                                                                                                                                                                                                                                                                                                                            • Empty base path

                                                                                                                                                                                                                                                                                                                                                                                                            When started, SparkUI binds to <> address that you can control using SPARK_PUBLIC_DNS environment variable or spark-driver.md#spark_driver_host[spark.driver.host] Spark property.

                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: With spark-webui-properties.md#spark.ui.killEnabled[spark.ui.killEnabled] configuration property turned on, SparkUI <> (subject to SecurityManager.checkModifyPermissions permissions).

                                                                                                                                                                                                                                                                                                                                                                                                            SparkUI gets an <> that is then used for the following:

                                                                                                                                                                                                                                                                                                                                                                                                            • <>, i.e. JobsTab.md#creating-instance[JobsTab], spark-webui-StagesTab.md#creating-instance[StagesTab], spark-webui-StorageTab.md#creating-instance[StorageTab], spark-webui-EnvironmentTab.md#creating-instance[EnvironmentTab]

                                                                                                                                                                                                                                                                                                                                                                                                            • AbstractApplicationResource is requested for spark-api-AbstractApplicationResource.md#jobsList[jobsList], spark-api-AbstractApplicationResource.md#oneJob[oneJob], spark-api-AbstractApplicationResource.md#executorList[executorList], spark-api-AbstractApplicationResource.md#allExecutorList[allExecutorList], spark-api-AbstractApplicationResource.md#rddList[rddList], spark-api-AbstractApplicationResource.md#rddData[rddData], spark-api-AbstractApplicationResource.md#environmentInfo[environmentInfo]

                                                                                                                                                                                                                                                                                                                                                                                                            • StagesResource is requested for spark-api-StagesResource.md#stageList[stageList], spark-api-StagesResource.md#stageData[stageData], spark-api-StagesResource.md#oneAttemptData[oneAttemptData], spark-api-StagesResource.md#taskSummary[taskSummary], spark-api-StagesResource.md#taskList[taskList]

                                                                                                                                                                                                                                                                                                                                                                                                            • SparkUI is requested for the current <>

                                                                                                                                                                                                                                                                                                                                                                                                            • Creating Spark SQL's SQLTab (when SQLHistoryServerPlugin is requested to setupUI)

                                                                                                                                                                                                                                                                                                                                                                                                            • Spark Streaming's BatchPage is created

                                                                                                                                                                                                                                                                                                                                                                                                            • [[internal-registries]] .SparkUI's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                                                                                                                                                                                                                                                                                              | appId | [[appId]] |===

                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/SparkUI/#tip","title":"[TIP]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                              Enable INFO logging level for org.apache.spark.ui.SparkUI logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                              Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                              log4j.logger.org.apache.spark.ui.SparkUI=INFO\n
                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"webui/SparkUI/#refer-to-spark-loggingmdlogging","title":"Refer to spark-logging.md[Logging].","text":"

                                                                                                                                                                                                                                                                                                                                                                                                              == [[setAppId]] Assigning Unique Identifier of Spark Application -- setAppId Method

                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"webui/SparkUI/#source-scala","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#setappidid-string-unit","title":"setAppId(id: String): Unit

                                                                                                                                                                                                                                                                                                                                                                                                              setAppId sets the internal <>.

                                                                                                                                                                                                                                                                                                                                                                                                              setAppId is used when SparkContext is created.

                                                                                                                                                                                                                                                                                                                                                                                                              == [[stop]] Stopping SparkUI -- stop Method

                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/SparkUI/#source-scala_1","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#stop-unit","title":"stop(): Unit

                                                                                                                                                                                                                                                                                                                                                                                                              stop stops the HTTP server and prints the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                              INFO SparkUI: Stopped Spark web UI at [appUIAddress]\n

                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: appUIAddress in the above INFO message is the result of <> method.

                                                                                                                                                                                                                                                                                                                                                                                                              == [[appUIAddress]] appUIAddress Method

                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/SparkUI/#source-scala_2","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#appuiaddress-string","title":"appUIAddress: String

                                                                                                                                                                                                                                                                                                                                                                                                              appUIAddress returns the entire URL of a Spark application's web UI, including http:// scheme.

                                                                                                                                                                                                                                                                                                                                                                                                              Internally, appUIAddress uses <>.

                                                                                                                                                                                                                                                                                                                                                                                                              == [[createLiveUI]] createLiveUI Method

                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/SparkUI/#source-scala_3","title":"[source, scala]

                                                                                                                                                                                                                                                                                                                                                                                                              createLiveUI( sc: SparkContext, conf: SparkConf, listenerBus: SparkListenerBus, jobProgressListener: JobProgressListener, securityManager: SecurityManager, appName: String, startTime: Long): SparkUI

                                                                                                                                                                                                                                                                                                                                                                                                              createLiveUI creates a SparkUI for a live running Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                              Internally, createLiveUI simply forwards the call to <>.

                                                                                                                                                                                                                                                                                                                                                                                                              createLiveUI is used when SparkContext is created.

                                                                                                                                                                                                                                                                                                                                                                                                              == [[createHistoryUI]] createHistoryUI Method

                                                                                                                                                                                                                                                                                                                                                                                                              CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                              == [[appUIHostPort]] appUIHostPort Method

                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/SparkUI/#source-scala_4","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#appuihostport-string","title":"appUIHostPort: String

                                                                                                                                                                                                                                                                                                                                                                                                              appUIHostPort returns the Spark application's web UI which is the public hostname and port, excluding the scheme.

                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: <> uses appUIHostPort and adds http:// scheme.

                                                                                                                                                                                                                                                                                                                                                                                                              == [[getAppName]] getAppName Method

                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/SparkUI/#source-scala_5","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#getappname-string","title":"getAppName: String

                                                                                                                                                                                                                                                                                                                                                                                                              getAppName returns the name of the Spark application (of a SparkUI instance).

                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: getAppName is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                              == [[create]] Creating SparkUI Instance -- create Factory Method

                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/SparkUI/#source-scala_6","title":"[source, scala]

                                                                                                                                                                                                                                                                                                                                                                                                              create( sc: Option[SparkContext], store: AppStatusStore, conf: SparkConf, securityManager: SecurityManager, appName: String, basePath: String = \"\", startTime: Long, appSparkVersion: String = org.apache.spark.SPARK_VERSION): SparkUI

                                                                                                                                                                                                                                                                                                                                                                                                              create creates a SparkUI backed by a core:AppStatusStore.md[].

                                                                                                                                                                                                                                                                                                                                                                                                              Internally, create simply creates a new <> (with the predefined Spark version).

                                                                                                                                                                                                                                                                                                                                                                                                              create is used when:

                                                                                                                                                                                                                                                                                                                                                                                                              • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                              • FsHistoryProvider is requested to spark-history-server:FsHistoryProvider.md#getAppUI[getAppUI] (for a Spark application that already finished)
                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/SparkUI/#creating-instance_1","title":"Creating Instance

                                                                                                                                                                                                                                                                                                                                                                                                              SparkUI takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                              • [[store]] core:AppStatusStore.md[]
                                                                                                                                                                                                                                                                                                                                                                                                              • [[sc]] SparkContext.md[]
                                                                                                                                                                                                                                                                                                                                                                                                              • [[conf]] SparkConf.md[SparkConf]
                                                                                                                                                                                                                                                                                                                                                                                                              • [[securityManager]] SecurityManager
                                                                                                                                                                                                                                                                                                                                                                                                              • [[appName]] Application name
                                                                                                                                                                                                                                                                                                                                                                                                              • [[basePath]] basePath
                                                                                                                                                                                                                                                                                                                                                                                                              • [[startTime]] Start time
                                                                                                                                                                                                                                                                                                                                                                                                              • [[appSparkVersion]] appSparkVersion

                                                                                                                                                                                                                                                                                                                                                                                                              SparkUI initializes the <> and <>.

                                                                                                                                                                                                                                                                                                                                                                                                              == [[initialize]] Attaching Tabs and Context Handlers -- initialize Method

                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/SparkUI/#source-scala_7","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#initialize-unit","title":"initialize(): Unit

                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: initialize is part of spark-webui-WebUI.md#initialize[WebUI Contract] to initialize web components.

                                                                                                                                                                                                                                                                                                                                                                                                              initialize creates and <> the following tabs (with the reference to the SparkUI and its <>):

                                                                                                                                                                                                                                                                                                                                                                                                              . spark-webui-StagesTab.md[StagesTab] . spark-webui-StorageTab.md[StorageTab] . spark-webui-EnvironmentTab.md[EnvironmentTab] . spark-webui-ExecutorsTab.md[ExecutorsTab]

                                                                                                                                                                                                                                                                                                                                                                                                              In the end, initialize creates and spark-webui-WebUI.md#attachHandler[attaches] the following ServletContextHandlers:

                                                                                                                                                                                                                                                                                                                                                                                                              . spark-webui-JettyUtils.md#createStaticHandler[Creates a static handler] for serving files from a static directory, i.e. /static to serve static files from org/apache/spark/ui/static directory (on CLASSPATH)

                                                                                                                                                                                                                                                                                                                                                                                                              . spark-api-ApiRootResource.md#getServletHandler[Creates the /api/* context handler] for the spark-api.md[Status REST API]

                                                                                                                                                                                                                                                                                                                                                                                                              . spark-webui-JettyUtils.md#createRedirectHandler[Creates a redirect handler] to redirect /jobs/job/kill to /jobs/ and request the JobsTab to execute handleKillRequest before redirection

                                                                                                                                                                                                                                                                                                                                                                                                              . spark-webui-JettyUtils.md#createRedirectHandler[Creates a redirect handler] to redirect /stages/stage/kill to /stages/ and request the StagesTab to execute spark-webui-StagesTab.md#handleKillRequest[handleKillRequest] before redirection

                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/SparkUITab/","title":"SparkUITab","text":"

                                                                                                                                                                                                                                                                                                                                                                                                              SparkUITab\u00a0is an extension of the WebUITab abstraction for UI tabs with the application name and Spark version.

                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"webui/SparkUITab/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                              • EnvironmentTab
                                                                                                                                                                                                                                                                                                                                                                                                              • ExecutorsTab
                                                                                                                                                                                                                                                                                                                                                                                                              • JobsTab
                                                                                                                                                                                                                                                                                                                                                                                                              • StagesTab
                                                                                                                                                                                                                                                                                                                                                                                                              • StorageTab
                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"webui/SparkUITab/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                              SparkUITab takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                              • Parent SparkUI
                                                                                                                                                                                                                                                                                                                                                                                                              • URL Prefix Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                SparkUITab\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete SparkUITabs.

                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"webui/SparkUITab/#application-name","title":"Application Name
                                                                                                                                                                                                                                                                                                                                                                                                                appName: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                appName requests the parent SparkUI for the appName.

                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"webui/SparkUITab/#spark-version","title":"Spark Version
                                                                                                                                                                                                                                                                                                                                                                                                                appSparkVersion: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                appSparkVersion requests the parent SparkUI for the appSparkVersion.

                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"webui/StagePage/","title":"StagePage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                StagePage is a WebUIPage of StagesTab.

                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"webui/StagePage/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                StagePage takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                • Parent StagesTab
                                                                                                                                                                                                                                                                                                                                                                                                                • AppStatusStore"},{"location":"webui/StagePage/#url-prefix","title":"URL Prefix

                                                                                                                                                                                                                                                                                                                                                                                                                  StagePage uses stage URL prefix.

                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"webui/StagePage/#rendering-page","title":"Rendering Page
                                                                                                                                                                                                                                                                                                                                                                                                                  render(\n  request: HttpServletRequest): Seq[Node]\n

                                                                                                                                                                                                                                                                                                                                                                                                                  render\u00a0is part of the WebUIPage abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                  render requires id and attempt request parameters.

                                                                                                                                                                                                                                                                                                                                                                                                                  render...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"webui/StagePage/#tasks-section","title":"Tasks Section","text":""},{"location":"webui/StagePage/#summary-metrics-for-completed-tasks-in-stage","title":"Summary Metrics for Completed Tasks in Stage

                                                                                                                                                                                                                                                                                                                                                                                                                  The summary metrics table shows the metrics for the tasks in a given stage that have already finished with SUCCESS status and metrics available.

                                                                                                                                                                                                                                                                                                                                                                                                                  The 1st row is Duration which includes the quantiles based on executorRunTime.

                                                                                                                                                                                                                                                                                                                                                                                                                  The 2nd row is the optional Scheduler Delay which includes the time to ship the task from the scheduler to executors, and the time to send the task result from the executors to the scheduler. It is not enabled by default and you should select Scheduler Delay checkbox under Show Additional Metrics to include it in the summary table.

                                                                                                                                                                                                                                                                                                                                                                                                                  The 3rd row is the optional Task Deserialization Time which includes the quantiles based on executorDeserializeTime task metric. It is not enabled by default and you should select Task Deserialization Time checkbox under Show Additional Metrics to include it in the summary table.

                                                                                                                                                                                                                                                                                                                                                                                                                  The 4th row is GC Time which is the time that an executor spent paused for Java garbage collection while the task was running (using jvmGCTime task metric).

                                                                                                                                                                                                                                                                                                                                                                                                                  The 5th row is the optional Result Serialization Time which is the time spent serializing the task result on a executor before sending it back to the driver (using resultSerializationTime task metric). It is not enabled by default and you should select Result Serialization Time checkbox under Show Additional Metrics to include it in the summary table.

                                                                                                                                                                                                                                                                                                                                                                                                                  The 6th row is the optional Getting Result Time which is the time that the driver spends fetching task results from workers. It is not enabled by default and you should select Getting Result Time checkbox under Show Additional Metrics to include it in the summary table.

                                                                                                                                                                                                                                                                                                                                                                                                                  The 7th row is the optional Peak Execution Memory which is the sum of the peak sizes of the internal data structures created during shuffles, aggregations and joins (using peakExecutionMemory task metric).

                                                                                                                                                                                                                                                                                                                                                                                                                  If the stage has an input, the 8th row is Input Size / Records which is the bytes and records read from Hadoop or from a Spark storage (using inputMetrics.bytesRead and inputMetrics.recordsRead task metrics).

                                                                                                                                                                                                                                                                                                                                                                                                                  If the stage has an output, the 9th row is Output Size / Records which is the bytes and records written to Hadoop or to a Spark storage (using outputMetrics.bytesWritten and outputMetrics.recordsWritten task metrics).

                                                                                                                                                                                                                                                                                                                                                                                                                  If the stage has shuffle read there will be three more rows in the table. The first row is Shuffle Read Blocked Time which is the time that tasks spent blocked waiting for shuffle data to be read from remote machines (using shuffleReadMetrics.fetchWaitTime task metric). The other row is Shuffle Read Size / Records which is the total shuffle bytes and records read (including both data read locally and data read from remote executors using shuffleReadMetrics.totalBytesRead and shuffleReadMetrics.recordsRead task metrics). And the last row is Shuffle Remote Reads which is the total shuffle bytes read from remote executors (which is a subset of the shuffle read bytes; the remaining shuffle data is read locally). It uses shuffleReadMetrics.remoteBytesRead task metric.

                                                                                                                                                                                                                                                                                                                                                                                                                  If the stage has shuffle write, the following row is Shuffle Write Size / Records (using shuffleWriteMetrics.bytesWritten and shuffleWriteMetrics.recordsWritten task metrics).

                                                                                                                                                                                                                                                                                                                                                                                                                  If the stage has bytes spilled, the following two rows are Shuffle spill (memory) (using memoryBytesSpilled task metric) and Shuffle spill (disk) (using diskBytesSpilled task metric).

                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"webui/StagePage/#dag-visualization","title":"DAG Visualization","text":""},{"location":"webui/StagePage/#event-timeline","title":"Event Timeline","text":""},{"location":"webui/StagePage/#stage-task-and-shuffle-stats","title":"Stage Task and Shuffle Stats","text":""},{"location":"webui/StagePage/#aggregated-metrics-by-executor","title":"Aggregated Metrics by Executor

                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorTable table shows the following columns:

                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                                                  • Address
                                                                                                                                                                                                                                                                                                                                                                                                                  • Task Time
                                                                                                                                                                                                                                                                                                                                                                                                                  • Total Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                  • Failed Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                  • Killed Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                  • Succeeded Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                  • (optional) Input Size / Records (only when the stage has an input)
                                                                                                                                                                                                                                                                                                                                                                                                                  • (optional) Output Size / Records (only when the stage has an output)
                                                                                                                                                                                                                                                                                                                                                                                                                  • (optional) Shuffle Read Size / Records (only when the stage read bytes for a shuffle)
                                                                                                                                                                                                                                                                                                                                                                                                                  • (optional) Shuffle Write Size / Records (only when the stage wrote bytes for a shuffle)
                                                                                                                                                                                                                                                                                                                                                                                                                  • (optional) Shuffle Spill (Memory) (only when the stage spilled memory bytes)
                                                                                                                                                                                                                                                                                                                                                                                                                  • (optional) Shuffle Spill (Disk) (only when the stage spilled bytes to disk)

                                                                                                                                                                                                                                                                                                                                                                                                                  It gets executorSummary from StageUIData (for the stage and stage attempt id) and creates rows per executor.

                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"webui/StagePage/#accumulators","title":"Accumulators

                                                                                                                                                                                                                                                                                                                                                                                                                  Stage page displays the table with named accumulators (only if they exist). It contains the name and value of the accumulators.

                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"webui/StagesTab/","title":"StagesTab","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                  StagesTab is a SparkUITab with stages URL prefix.

                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/StagesTab/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                  StagesTab takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                  • Parent SparkUI
                                                                                                                                                                                                                                                                                                                                                                                                                  • AppStatusStore

                                                                                                                                                                                                                                                                                                                                                                                                                    StagesTab is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkUI is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"webui/StagesTab/#pages","title":"Pages","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                    When created, StagesTab attaches the following pages:

                                                                                                                                                                                                                                                                                                                                                                                                                    • AllStagesPage
                                                                                                                                                                                                                                                                                                                                                                                                                    • StagePage (with the AppStatusStore)
                                                                                                                                                                                                                                                                                                                                                                                                                    • PoolPage
                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"webui/StagesTab/#introduction","title":"Introduction","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                    Stages tab shows the current state of all stages of all jobs in a Spark application with two optional pages for the tasks and statistics for a stage (when a stage is selected) and pool details (when the application works in FAIR scheduling mode).

                                                                                                                                                                                                                                                                                                                                                                                                                    The title of the tab is Stages for All Jobs.

                                                                                                                                                                                                                                                                                                                                                                                                                    With no jobs submitted yet (and hence no stages to display), the page shows nothing but the title.

                                                                                                                                                                                                                                                                                                                                                                                                                    The Stages page shows the stages in a Spark application per state in their respective sections:

                                                                                                                                                                                                                                                                                                                                                                                                                    • Active Stages
                                                                                                                                                                                                                                                                                                                                                                                                                    • Pending Stages
                                                                                                                                                                                                                                                                                                                                                                                                                    • Completed Stages
                                                                                                                                                                                                                                                                                                                                                                                                                    • Failed Stages

                                                                                                                                                                                                                                                                                                                                                                                                                    The state sections are only displayed when there are stages in a given state.

                                                                                                                                                                                                                                                                                                                                                                                                                    In FAIR scheduling mode you have access to the table showing the scheduler pools.

                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"webui/StoragePage/","title":"StoragePage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                    StoragePage is a WebUIPage of StorageTab.

                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"webui/StoragePage/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                    StoragePage takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                    • Parent SparkUITab
                                                                                                                                                                                                                                                                                                                                                                                                                    • AppStatusStore

                                                                                                                                                                                                                                                                                                                                                                                                                      StoragePage is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                      • StorageTab is created
                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/StoragePage/#rendering-page","title":"Rendering Page
                                                                                                                                                                                                                                                                                                                                                                                                                      render(\n  request: HttpServletRequest): Seq[Node]\n

                                                                                                                                                                                                                                                                                                                                                                                                                      render\u00a0is part of the WebUIPage abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                      render renders a Storage page with the RDDs and streaming blocks (from the AppStatusStore).

                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"webui/StoragePage/#rdd-tables-headers","title":"RDD Table's Headers

                                                                                                                                                                                                                                                                                                                                                                                                                      StoragePage uses the following headers and tooltips for the RDD table.

                                                                                                                                                                                                                                                                                                                                                                                                                      Header Tooltip ID RDD Name Name of the persisted RDD Storage Level StorageLevel displays where the persisted RDD is stored, format of the persisted RDD (serialized or de-serialized) and replication factor of the persisted RDD Cached Partitions Number of partitions cached Fraction Cached Fraction of total partitions cached Size in Memory Total size of partitions in memory Size on Disk Total size of partitions on the disk","text":""},{"location":"webui/StorageTab/","title":"StorageTab","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                      StorageTab is a SparkUITab with storage URL prefix.

                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/StorageTab/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                      StorageTab takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                      • Parent SparkUI
                                                                                                                                                                                                                                                                                                                                                                                                                      • AppStatusStore

                                                                                                                                                                                                                                                                                                                                                                                                                        StorageTab is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkUI is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"webui/StorageTab/#pages","title":"Pages","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                        When created, StorageTab attaches the following pages (with a reference to itself and the AppStatusStore):

                                                                                                                                                                                                                                                                                                                                                                                                                        • StoragePage
                                                                                                                                                                                                                                                                                                                                                                                                                        • RDDPage
                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"webui/UIUtils/","title":"UIUtils","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                        == [[UIUtils]] UIUtils

                                                                                                                                                                                                                                                                                                                                                                                                                        UIUtils is a utility object for...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                        === [[headerSparkPage]] headerSparkPage Method

                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"webui/UIUtils/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                        headerSparkPage( request: HttpServletRequest, title: String, content: => Seq[Node], activeTab: SparkUITab, refreshInterval: Option[Int] = None, helpText: Option[String] = None, showVisualization: Boolean = false, useDataTables: Boolean = false): Seq[Node]

                                                                                                                                                                                                                                                                                                                                                                                                                        headerSparkPage...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: headerSparkPage is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"webui/WebUI/","title":"WebUI","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                        WebUI is an abstraction of UIs.

                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"webui/WebUI/#contract","title":"Contract","text":""},{"location":"webui/WebUI/#initializing","title":"Initializing
                                                                                                                                                                                                                                                                                                                                                                                                                        initialize(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                        Initializes components of the UI

                                                                                                                                                                                                                                                                                                                                                                                                                        Used by the implementations themselves.

                                                                                                                                                                                                                                                                                                                                                                                                                        Note

                                                                                                                                                                                                                                                                                                                                                                                                                        initialize does not add anything special to the Scala type hierarchy but a common name to use across WebUIs. In other words, initialize does not participate in any design pattern or a type hierarchy and serves no purpose of being part of the contract.

                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"webui/WebUI/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                        • HistoryServer
                                                                                                                                                                                                                                                                                                                                                                                                                        • MasterWebUI (Spark Standalone)
                                                                                                                                                                                                                                                                                                                                                                                                                        • MesosClusterUI (Spark on Mesos)
                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkUI
                                                                                                                                                                                                                                                                                                                                                                                                                        • WorkerWebUI (Spark Standalone)
                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"webui/WebUI/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                        WebUI takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                        • SecurityManager
                                                                                                                                                                                                                                                                                                                                                                                                                        • SSLOptions
                                                                                                                                                                                                                                                                                                                                                                                                                        • Port
                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                        • Base Path (default: empty)
                                                                                                                                                                                                                                                                                                                                                                                                                        • Name (default: empty) Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                          WebUI\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete WebUIs.

                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"webui/WebUI/#tabs","title":"Tabs

                                                                                                                                                                                                                                                                                                                                                                                                                          WebUI uses tabs registry for WebUITabs (that have been attached).

                                                                                                                                                                                                                                                                                                                                                                                                                          Tabs can be attached and detached.

                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/WebUI/#attaching-tab","title":"Attaching Tab
                                                                                                                                                                                                                                                                                                                                                                                                                          attachTab(\n  tab: WebUITab): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                          attachTab attaches the pages of the given WebUITab (and adds it to the tabs).

                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/WebUI/#detaching-tab","title":"Detaching Tab
                                                                                                                                                                                                                                                                                                                                                                                                                          detachTab(\n  tab: WebUITab): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                          detachTab detaches the pages of the given WebUITab (and removes it from the tabs).

                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/WebUI/#pages","title":"Pages

                                                                                                                                                                                                                                                                                                                                                                                                                          WebUI uses pageToHandlers registry for WebUIPages and their associated ServletContextHandlers.

                                                                                                                                                                                                                                                                                                                                                                                                                          Pages can be attached and detached.

                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/WebUI/#attaching-page","title":"Attaching Page
                                                                                                                                                                                                                                                                                                                                                                                                                          attachPage(\n  page: WebUIPage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                          attachPage...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                          attachPage is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                          • WebUI is requested to attach a tab
                                                                                                                                                                                                                                                                                                                                                                                                                          • others
                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/WebUI/#detaching-page","title":"Detaching Page
                                                                                                                                                                                                                                                                                                                                                                                                                          detachPage(\n  page: WebUIPage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                          detachPage removes the given WebUIPage from the UI (the pageToHandlers registry) with all of the handlers.

                                                                                                                                                                                                                                                                                                                                                                                                                          detachPage is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                          • WebUI is requested to detach a tab
                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/WebUI/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                          Since WebUI is an abstract class, logging is configured using the logger of the implementations.

                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/WebUIPage/","title":"WebUIPage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                          WebUIPage is an abstraction of pages (of a WebUI) that can be rendered to HTML and JSON.

                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"webui/WebUIPage/#contract","title":"Contract","text":""},{"location":"webui/WebUIPage/#rendering-page-to-html","title":"Rendering Page (to HTML)
                                                                                                                                                                                                                                                                                                                                                                                                                          render(\n  request: HttpServletRequest): Seq[Node]\n

                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                          • WebUI is requested to attach a page (to handle the URL)
                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/WebUIPage/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                          • AllExecutionsPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • AllJobsPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • AllStagesPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • ApplicationPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • BatchPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • DriverPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • EnvironmentPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • ExecutionPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • ExecutorsPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • ExecutorThreadDumpPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • HistoryPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • JobPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • LogPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • MasterPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • MesosClusterPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • PoolPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • RDDPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • StagePage
                                                                                                                                                                                                                                                                                                                                                                                                                          • StoragePage
                                                                                                                                                                                                                                                                                                                                                                                                                          • StreamingPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • StreamingQueryPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • StreamingQueryStatisticsPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • ThriftServerPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • ThriftServerSessionPage
                                                                                                                                                                                                                                                                                                                                                                                                                          • WorkerPage
                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"webui/WebUIPage/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                          WebUIPage takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                          • URL Prefix Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                            WebUIPage\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete WebUIPages.

                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"webui/WebUIPage/#rendering-page-to-json","title":"Rendering Page to JSON
                                                                                                                                                                                                                                                                                                                                                                                                                            renderJson(\n  request: HttpServletRequest): JValue\n

                                                                                                                                                                                                                                                                                                                                                                                                                            renderJson returns a JNothing by default.

                                                                                                                                                                                                                                                                                                                                                                                                                            renderJson\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                            • WebUI is requested to attach a page (and handle the /json URL)
                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"webui/WebUITab/","title":"WebUITab","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                            WebUITab is an abstraction of UI tabs with a name and pages.

                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"webui/WebUITab/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkUITab
                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"webui/WebUITab/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                            WebUITab takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                            • WebUI
                                                                                                                                                                                                                                                                                                                                                                                                                            • Prefix Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                              WebUITab\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete WebUITabs.

                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"webui/WebUITab/#name","title":"Name
                                                                                                                                                                                                                                                                                                                                                                                                                              name: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                              WebUITab has a name that is the prefix capitalized by default.

                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/WebUITab/#pages","title":"Pages
                                                                                                                                                                                                                                                                                                                                                                                                                              pages: ArrayBuffer[WebUIPage]\n

                                                                                                                                                                                                                                                                                                                                                                                                                              WebUITab has WebUIPages.

                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/WebUITab/#attaching-page","title":"Attaching Page
                                                                                                                                                                                                                                                                                                                                                                                                                              attachPage(\n  page: WebUIPage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                              attachPage registers the WebUIPage (in the pages registry).

                                                                                                                                                                                                                                                                                                                                                                                                                              attachPage adds the prefix of this WebUITab before the prefix of the given WebUIPage:

                                                                                                                                                                                                                                                                                                                                                                                                                              [prefix]/[page.prefix]\n
                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/configuration-properties/","title":"web UI Configuration Properties","text":""},{"location":"webui/configuration-properties/#sparkuicustomexecutorlogurl","title":"spark.ui.custom.executor.log.url

                                                                                                                                                                                                                                                                                                                                                                                                                              Specifies custom spark executor log url for supporting external log service instead of using cluster managers' application log urls in the Spark UI. Spark will support some path variables via patterns which can vary on cluster manager. Please check the documentation for your cluster manager to see which patterns are supported, if any. This configuration replaces original log urls in event log, which will be also effective when accessing the application on history server. The new log urls must be permanent, otherwise you might have dead link for executor log urls.

                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                              • DriverEndpoint is created (and initializes an ExecutorLogUrlHandler)
                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/configuration-properties/#sparkuienabled","title":"spark.ui.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                              Controls whether the web UI is started for the Spark application

                                                                                                                                                                                                                                                                                                                                                                                                                              Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/configuration-properties/#sparkuiport","title":"spark.ui.port

                                                                                                                                                                                                                                                                                                                                                                                                                              The port the web UI of a Spark application binds to

                                                                                                                                                                                                                                                                                                                                                                                                                              Default: 4040

                                                                                                                                                                                                                                                                                                                                                                                                                              If multiple SparkContexts attempt to run on the same host (as different Spark applications), they will bind to successive ports beginning with spark.ui.port (until spark.port.maxRetries).

                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkUI utility is used to get the UI port
                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/configuration-properties/#sparkuiprometheusenabled","title":"spark.ui.prometheus.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                              internal Expose executor metrics at /metrics/executors/prometheus

                                                                                                                                                                                                                                                                                                                                                                                                                              Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkUI is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/configuration-properties/#review-me","title":"Review Me

                                                                                                                                                                                                                                                                                                                                                                                                                              [[properties]] .web UI Configuration Properties [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Name | Default Value | Description

                                                                                                                                                                                                                                                                                                                                                                                                                              [[spark.ui.allowFramingFrom]] spark.ui.allowFramingFrom Defines the URL to use in ALLOW-FROM in X-Frame-Options header (as described in http://tools.ietf.org/html/rfc7034).

                                                                                                                                                                                                                                                                                                                                                                                                                              Used exclusively when JettyUtils is requested to spark-webui-JettyUtils.md#createServlet[create an HttpServlet].

                                                                                                                                                                                                                                                                                                                                                                                                                              | [[spark.ui.consoleProgress.update.interval]] spark.ui.consoleProgress.update.interval | 200 (ms) | Update interval, i.e. how often to show the progress.

                                                                                                                                                                                                                                                                                                                                                                                                                              | [[spark.ui.killEnabled]] spark.ui.killEnabled | true | Enables jobs and stages to be killed from the web UI (true) or not (false).

                                                                                                                                                                                                                                                                                                                                                                                                                              Used exclusively when SparkUI is requested to spark-webui-SparkUI.md#initialize[initialize] (and registers the redirect handlers for /jobs/job/kill and /stages/stage/kill URIs)

                                                                                                                                                                                                                                                                                                                                                                                                                              | [[spark.ui.retainedDeadExecutors]] spark.ui.retainedDeadExecutors | 100 |

                                                                                                                                                                                                                                                                                                                                                                                                                              | [[spark.ui.timeline.executors.maximum]] spark.ui.timeline.executors.maximum | 1000 | The maximum number of entries in <> registry.

                                                                                                                                                                                                                                                                                                                                                                                                                              | [[spark.ui.timeline.tasks.maximum]] spark.ui.timeline.tasks.maximum | 1000 | |===

                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"The Internals of Spark Core (Apache Spark 3.5.0)","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                              Welcome to The Internals of Spark Core online book! \ud83e\udd19

                                                                                                                                                                                                                                                                                                                                                                                                                              I'm Jacek Laskowski, a Freelance Data Engineer specializing in Apache Spark (incl. Spark SQL and Spark Structured Streaming), Delta Lake, Databricks, and Apache Kafka (incl. Kafka Streams) with brief forays into a wider data engineering space (e.g., Trino, Dask and dbt, mostly during Warsaw Data Engineering meetups).

                                                                                                                                                                                                                                                                                                                                                                                                                              I'm very excited to have you here and hope you will enjoy exploring the internals of Spark Core as much as I have.

                                                                                                                                                                                                                                                                                                                                                                                                                              Flannery O'Connor

                                                                                                                                                                                                                                                                                                                                                                                                                              I write to discover what I know.

                                                                                                                                                                                                                                                                                                                                                                                                                              \"The Internals Of\" series

                                                                                                                                                                                                                                                                                                                                                                                                                              I'm also writing other online books in the \"The Internals Of\" series. Please visit \"The Internals Of\" Online Books home page.

                                                                                                                                                                                                                                                                                                                                                                                                                              Expect text and code snippets from a variety of public sources. Attribution follows.

                                                                                                                                                                                                                                                                                                                                                                                                                              Now, let's take a deep dive into Spark Core \ud83d\udd25

                                                                                                                                                                                                                                                                                                                                                                                                                              Last update: 2024-02-17

                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"BytesToBytesMap/","title":"BytesToBytesMap","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                              BytesToBytesMap is a memory consumer that supports spilling.

                                                                                                                                                                                                                                                                                                                                                                                                                              Spark SQL

                                                                                                                                                                                                                                                                                                                                                                                                                              BytesToBytesMap is used in Spark SQL only in the following:

                                                                                                                                                                                                                                                                                                                                                                                                                              • UnsafeFixedWidthAggregationMap
                                                                                                                                                                                                                                                                                                                                                                                                                              • UnsafeHashedRelation
                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"BytesToBytesMap/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                              BytesToBytesMap takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                              • TaskMemoryManager
                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockManager
                                                                                                                                                                                                                                                                                                                                                                                                                              • SerializerManager
                                                                                                                                                                                                                                                                                                                                                                                                                              • Initial Capacity
                                                                                                                                                                                                                                                                                                                                                                                                                              • Load Factor (default: 0.5)
                                                                                                                                                                                                                                                                                                                                                                                                                              • Page Size (bytes)

                                                                                                                                                                                                                                                                                                                                                                                                                                BytesToBytesMap is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                • UnsafeFixedWidthAggregationMap (Spark SQL) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                • UnsafeHashedRelation (Spark SQL) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"BytesToBytesMap/#destructive-mapiterator","title":"Destructive MapIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                MapIterator destructiveIterator\n

                                                                                                                                                                                                                                                                                                                                                                                                                                BytesToBytesMap defines a reference to a \"destructive\" MapIterator (if ever created for UnsafeFixedWidthAggregationMap (Spark SQL)).

                                                                                                                                                                                                                                                                                                                                                                                                                                The destructiveIterator reference is in two states:

                                                                                                                                                                                                                                                                                                                                                                                                                                • Undefined (null) initially when BytesToBytesMap is created
                                                                                                                                                                                                                                                                                                                                                                                                                                • The MapIterator if created
                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"BytesToBytesMap/#creating-destructive-mapiterator","title":"Creating Destructive MapIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                MapIterator destructiveIterator()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                destructiveIterator updatePeakMemoryUsed and then creates a MapIterator with the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                • numValues for the number of records
                                                                                                                                                                                                                                                                                                                                                                                                                                • A new Location
                                                                                                                                                                                                                                                                                                                                                                                                                                • Destructive flag enabled (true)

                                                                                                                                                                                                                                                                                                                                                                                                                                destructiveIterator is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                • UnsafeFixedWidthAggregationMap (Spark SQL) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"BytesToBytesMap/#spilling","title":"Spilling
                                                                                                                                                                                                                                                                                                                                                                                                                                long spill(\n  long size,\n  MemoryConsumer trigger)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                spill is part of the MemoryConsumer abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                Only when the given MemoryConsumer is not this BytesToBytesMap and the destructive MapIterator has been used, spill requests the destructive MapIterator to spill (the given size bytes).

                                                                                                                                                                                                                                                                                                                                                                                                                                spill returns 0 when the trigger is this BytesToBytesMap or there is no destructiveIterator in use. Otherwise, spill returns how much bytes the destructiveIterator managed to release.

                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"BytesToBytesMap/#numvalues","title":"numValues

                                                                                                                                                                                                                                                                                                                                                                                                                                numValues registry is 0 after reset.

                                                                                                                                                                                                                                                                                                                                                                                                                                numValues is incremented when Location is requested to append

                                                                                                                                                                                                                                                                                                                                                                                                                                numValues can never be bigger than maximum capacity of this BytesToBytesMap or growthThreshold.

                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"BytesToBytesMap/#maximum-capacity","title":"Maximum Capacity

                                                                                                                                                                                                                                                                                                                                                                                                                                BytesToBytesMap supports up to 1 << 29 keys.

                                                                                                                                                                                                                                                                                                                                                                                                                                BytesToBytesMap makes sure that the initialCapacity is not bigger when creted.

                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"BytesToBytesMap/#allocating-memory","title":"Allocating Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                void allocate(\n  int capacity)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                allocate...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                allocate is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                • BytesToBytesMap is created, reset, growAndRehash
                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"BytesToBytesMap/#growing-memory-and-rehashing","title":"Growing Memory And Rehashing
                                                                                                                                                                                                                                                                                                                                                                                                                                void growAndRehash()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                growAndRehash...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                growAndRehash is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                • Location is requested to append (a new value for a key)
                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"ConsoleProgressBar/","title":"ConsoleProgressBar","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                ConsoleProgressBar shows the progress of active stages to standard error, i.e. stderr. It uses SparkStatusTracker to poll the status of stages periodically and print out active stages with more than one task. It keeps overwriting itself to hold in one line for at most 3 first concurrent stages at a time.

                                                                                                                                                                                                                                                                                                                                                                                                                                [Stage 0:====>          (316 + 4) / 1000][Stage 1:>                (0 + 0) / 1000][Stage 2:>                (0 + 0) / 1000]]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                The progress includes the stage id, the number of completed, active, and total tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                TIP: ConsoleProgressBar may be useful when you ssh to workers and want to see the progress of active stages.

                                                                                                                                                                                                                                                                                                                                                                                                                                <ConsoleProgressBar is created>> when SparkContext is created with spark.ui.showConsoleProgress enabled and the logging level of SparkContext.md[org.apache.spark.SparkContext] logger as WARN or higher (i.e. less messages are printed out and so there is a \"space\" for ConsoleProgressBar)."},{"location":"ConsoleProgressBar/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                import org.apache.log4j._ Logger.getLogger(\"org.apache.spark.SparkContext\").setLevel(Level.WARN)

                                                                                                                                                                                                                                                                                                                                                                                                                                To print the progress nicely ConsoleProgressBar uses COLUMNS environment variable to know the width of the terminal. It assumes 80 columns.

                                                                                                                                                                                                                                                                                                                                                                                                                                The progress bar prints out the status after a stage has ran at least 500 milliseconds every spark-webui-properties.md#spark.ui.consoleProgress.update.interval[spark.ui.consoleProgress.update.interval] milliseconds.

                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: The initial delay of 500 milliseconds before ConsoleProgressBar show the progress is not configurable.

                                                                                                                                                                                                                                                                                                                                                                                                                                See the progress bar in Spark shell with the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"ConsoleProgressBar/#source","title":"[source]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                $ ./bin/spark-shell --conf spark.ui.showConsoleProgress=true # <1>

                                                                                                                                                                                                                                                                                                                                                                                                                                scala> sc.setLogLevel(\"OFF\") // <2>

                                                                                                                                                                                                                                                                                                                                                                                                                                import org.apache.log4j._ scala> Logger.getLogger(\"org.apache.spark.SparkContext\").setLevel(Level.WARN) // <3>

                                                                                                                                                                                                                                                                                                                                                                                                                                scala> sc.parallelize(1 to 4, 4).map { n => Thread.sleep(500 + 200 * n); n }.count // <4> [Stage 2:> (0 + 4) / 4] [Stage 2:==============> (1 + 3) / 4] [Stage 2:=============================> (2 + 2) / 4] [Stage 2:============================================> (3 + 1) / 4]

                                                                                                                                                                                                                                                                                                                                                                                                                                <1> Make sure spark.ui.showConsoleProgress is true. It is by default. <2> Disable (OFF) the root logger (that includes Spark's logger) <3> Make sure org.apache.spark.SparkContext logger is at least WARN. <4> Run a job with 4 tasks with 500ms initial sleep and 200ms sleep chunks to see the progress bar.

                                                                                                                                                                                                                                                                                                                                                                                                                                TIP: https://youtu.be/uEmcGo8rwek[Watch the short video] that show ConsoleProgressBar in action.

                                                                                                                                                                                                                                                                                                                                                                                                                                You may want to use the following example to see the progress bar in full glory - all 3 concurrent stages in console (borrowed from https://github.com/apache/spark/pull/3029#issuecomment-63244719[a comment to [SPARK-4017] show progress bar in console #3029]):

                                                                                                                                                                                                                                                                                                                                                                                                                                > ./bin/spark-shell\nscala> val a = sc.makeRDD(1 to 1000, 10000).map(x => (x, x)).reduceByKey(_ + _)\nscala> val b = sc.makeRDD(1 to 1000, 10000).map(x => (x, x)).reduceByKey(_ + _)\nscala> a.union(b).count()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                === [[creating-instance]] Creating ConsoleProgressBar Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                ConsoleProgressBar requires a SparkContext.md[SparkContext].

                                                                                                                                                                                                                                                                                                                                                                                                                                When being created, ConsoleProgressBar reads spark-webui-properties.md#spark.ui.consoleProgress.update.interval[spark.ui.consoleProgress.update.interval] configuration property to set up the update interval and COLUMNS environment variable for the terminal width (or assumes 80 columns).

                                                                                                                                                                                                                                                                                                                                                                                                                                ConsoleProgressBar starts the internal timer refresh progress that does <> and shows progress.

                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: ConsoleProgressBar is created when SparkContext is created, spark.ui.showConsoleProgress configuration property is enabled, and the logging level of SparkContext.md[org.apache.spark.SparkContext] logger is WARN or higher (i.e. less messages are printed out and so there is a \"space\" for ConsoleProgressBar).

                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: Once created, ConsoleProgressBar is available internally as _progressBar.

                                                                                                                                                                                                                                                                                                                                                                                                                                === [[finishAll]] finishAll Method

                                                                                                                                                                                                                                                                                                                                                                                                                                CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                === [[stop]] stop Method

                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"ConsoleProgressBar/#source-scala_1","title":"[source, scala]","text":""},{"location":"ConsoleProgressBar/#stop-unit","title":"stop(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                stop cancels (stops) the internal timer.

                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: stop is executed when SparkContext.md#stop[SparkContext stops].

                                                                                                                                                                                                                                                                                                                                                                                                                                === [[refresh]] refresh Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"ConsoleProgressBar/#source-scala_2","title":"[source, scala]","text":""},{"location":"ConsoleProgressBar/#refresh-unit","title":"refresh(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                refresh...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: refresh is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"DriverLogger/","title":"DriverLogger","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                DriverLogger runs on the driver (in client deploy mode) to copy driver logs to Hadoop DFS periodically.

                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"DriverLogger/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                DriverLogger takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkConf

                                                                                                                                                                                                                                                                                                                                                                                                                                  DriverLogger is created using apply utility.

                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"DriverLogger/#creating-driverlogger","title":"Creating DriverLogger
                                                                                                                                                                                                                                                                                                                                                                                                                                  apply(\n  conf: SparkConf): Option[DriverLogger]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                  apply creates a DriverLogger when the following hold:

                                                                                                                                                                                                                                                                                                                                                                                                                                  1. spark.driver.log.persistToDfs.enabled configuration property is enabled
                                                                                                                                                                                                                                                                                                                                                                                                                                  2. The Spark application runs in client deploy mode (and spark.submit.deployMode is client)
                                                                                                                                                                                                                                                                                                                                                                                                                                  3. spark.driver.log.dfsDir is specified

                                                                                                                                                                                                                                                                                                                                                                                                                                  apply prints out the following WARN message to the logs with no spark.driver.log.dfsDir specified:

                                                                                                                                                                                                                                                                                                                                                                                                                                  Driver logs are not persisted because spark.driver.log.dfsDir is not configured\n

                                                                                                                                                                                                                                                                                                                                                                                                                                  apply\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"DriverLogger/#starting-dfsasyncwriter","title":"Starting DfsAsyncWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                  startSync(\n  hadoopConf: Configuration): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                  startSync creates and starts a DfsAsyncWriter (with the spark.app.id configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                  startSync\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is requested to postApplicationStart
                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"ExecutorDeadException/","title":"ExecutorDeadException","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorDeadException is a SparkException.

                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"ExecutorDeadException/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorDeadException takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                  • Error message

                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorDeadException is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                    • NettyBlockTransferService is requested to fetch blocks
                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"FileCommitProtocol/","title":"FileCommitProtocol","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                    FileCommitProtocol is an abstraction of file committers that can setup, commit or abort a Spark job or task (while writing out a pair RDD and partitions).

                                                                                                                                                                                                                                                                                                                                                                                                                                    FileCommitProtocol is used for RDD.saveAsNewAPIHadoopDataset and RDD.saveAsHadoopDataset transformations (that use SparkHadoopWriter utility to write a key-value RDD out).

                                                                                                                                                                                                                                                                                                                                                                                                                                    FileCommitProtocol is created using FileCommitProtocol.instantiate utility.

                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"FileCommitProtocol/#contract","title":"Contract","text":""},{"location":"FileCommitProtocol/#abortJob","title":"Aborting Job","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                    abortJob(\n  jobContext: JobContext): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                    Aborts a job

                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkHadoopWriter utility is used to write a key-value RDD (and writing fails)
                                                                                                                                                                                                                                                                                                                                                                                                                                    • (Spark SQL) FileFormatWriter utility is used to write a result of a structured query (and writing fails)
                                                                                                                                                                                                                                                                                                                                                                                                                                    • (Spark SQL) FileBatchWrite is requested to abort
                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"FileCommitProtocol/#abortTask","title":"Aborting Task","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                    abortTask(\n  taskContext: TaskAttemptContext): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                    Abort a task

                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkHadoopWriter utility is used to write an RDD partition
                                                                                                                                                                                                                                                                                                                                                                                                                                    • (Spark SQL) FileFormatDataWriter is requested to abort
                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"FileCommitProtocol/#commitJob","title":"Committing Job","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                    commitJob(\n  jobContext: JobContext,\n  taskCommits: Seq[TaskCommitMessage]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                    Commits a job after the writes succeed

                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkHadoopWriter utility is used to write a key-value RDD
                                                                                                                                                                                                                                                                                                                                                                                                                                    • (Spark SQL) FileFormatWriter utility is used to write a result of a structured query
                                                                                                                                                                                                                                                                                                                                                                                                                                    • (Spark SQL) FileBatchWrite is requested to commit
                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"FileCommitProtocol/#commitTask","title":"Committing Task","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                    commitTask(\n  taskContext: TaskAttemptContext): TaskCommitMessage\n

                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkHadoopWriter utility is used to write an RDD partition
                                                                                                                                                                                                                                                                                                                                                                                                                                    • (Spark SQL) FileFormatDataWriter is requested to commit
                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"FileCommitProtocol/#deleteWithJob","title":"Deleting Path with Job","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                    deleteWithJob(\n  fs: FileSystem,\n  path: Path,\n  recursive: Boolean): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                    deleteWithJob requests the given Hadoop FileSystem to delete a path directory.

                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when InsertIntoHadoopFsRelationCommand logical command (Spark SQL) is executed

                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"FileCommitProtocol/#newTaskTempFile","title":"New Task Temp File","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                    newTaskTempFile(\n  taskContext: TaskAttemptContext,\n  dir: Option[String],\n  spec: FileNameSpec): String\nnewTaskTempFile(\n  taskContext: TaskAttemptContext,\n  dir: Option[String],\n  ext: String): String // @deprecated\n

                                                                                                                                                                                                                                                                                                                                                                                                                                    Builds a path of a temporary file (for a task to write data to)

                                                                                                                                                                                                                                                                                                                                                                                                                                    See:

                                                                                                                                                                                                                                                                                                                                                                                                                                    • HadoopMapReduceCommitProtocol
                                                                                                                                                                                                                                                                                                                                                                                                                                    • DelayedCommitProtocol (Delta Lake)

                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                    • (Spark SQL) SingleDirectoryDataWriter is requested to write a record out
                                                                                                                                                                                                                                                                                                                                                                                                                                    • (Spark SQL) BaseDynamicPartitionDataWriter is requested to renewCurrentWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"FileCommitProtocol/#newTaskTempFileAbsPath","title":"newTaskTempFileAbsPath","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                    newTaskTempFileAbsPath(\n  taskContext: TaskAttemptContext,\n  absoluteDir: String,\n  ext: String): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                    • (Spark SQL) DynamicPartitionDataWriter is requested to write
                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"FileCommitProtocol/#onTaskCommit","title":"On Task Committed","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                    onTaskCommit(\n  taskCommit: TaskCommitMessage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                    • (Spark SQL) FileFormatWriter is requested to write
                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"FileCommitProtocol/#setupJob","title":"Setting Up Job","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                    setupJob(\n  jobContext: JobContext): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkHadoopWriter utility is used to write an RDD partition (while writing out a key-value RDD)
                                                                                                                                                                                                                                                                                                                                                                                                                                    • (Spark SQL) FileFormatWriter utility is used to write a result of a structured query
                                                                                                                                                                                                                                                                                                                                                                                                                                    • (Spark SQL) FileWriteBuilder is requested to buildForBatch
                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"FileCommitProtocol/#setupTask","title":"Setting Up Task","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                    setupTask(\n  taskContext: TaskAttemptContext): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                    Sets up the task with the Hadoop TaskAttemptContext

                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkHadoopWriter is requested to write an RDD partition (while writing out a key-value RDD)
                                                                                                                                                                                                                                                                                                                                                                                                                                    • (Spark SQL) FileFormatWriter utility is used to write out a RDD partition (while writing out a result of a structured query)
                                                                                                                                                                                                                                                                                                                                                                                                                                    • (Spark SQL) FileWriterFactory is requested to createWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"FileCommitProtocol/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                    • HadoopMapReduceCommitProtocol
                                                                                                                                                                                                                                                                                                                                                                                                                                    • ManifestFileCommitProtocol (qv. Spark Structured Streaming)
                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"FileCommitProtocol/#instantiating-filecommitprotocol-committer","title":"Instantiating FileCommitProtocol Committer
                                                                                                                                                                                                                                                                                                                                                                                                                                    instantiate(\n  className: String,\n  jobId: String,\n  outputPath: String,\n  dynamicPartitionOverwrite: Boolean = false): FileCommitProtocol\n

                                                                                                                                                                                                                                                                                                                                                                                                                                    instantiate prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                    Creating committer [className]; job [jobId]; output=[outputPath]; dynamic=[dynamicPartitionOverwrite]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                    instantiate tries to find a constructor method that takes three arguments (two of type String and one Boolean) for the given jobId, outputPath and dynamicPartitionOverwrite flag. If found, instantiate prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                    Using (String, String, Boolean) constructor\n

                                                                                                                                                                                                                                                                                                                                                                                                                                    In case of NoSuchMethodException, instantiate prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                    Falling back to (String, String) constructor\n

                                                                                                                                                                                                                                                                                                                                                                                                                                    instantiate tries to find a constructor method that takes two arguments (two of type String) for the given jobId and outputPath.

                                                                                                                                                                                                                                                                                                                                                                                                                                    With two String arguments, instantiate requires that the given dynamicPartitionOverwrite flag is disabled (false) or throws an IllegalArgumentException:

                                                                                                                                                                                                                                                                                                                                                                                                                                    requirement failed: Dynamic Partition Overwrite is enabled but the committer [className] does not have the appropriate constructor\n

                                                                                                                                                                                                                                                                                                                                                                                                                                    instantiate is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                    • HadoopMapRedWriteConfigUtil and HadoopMapReduceWriteConfigUtil are requested to create a HadoopMapReduceCommitProtocol committer
                                                                                                                                                                                                                                                                                                                                                                                                                                    • (Spark SQL) InsertIntoHadoopFsRelationCommand, InsertIntoHiveDirCommand, and InsertIntoHiveTable logical commands are executed
                                                                                                                                                                                                                                                                                                                                                                                                                                    • (Spark Structured Streaming) FileStreamSink is requested to write out a micro-batch data
                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"FileCommitProtocol/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                    Enable ALL logging level for org.apache.spark.internal.io.FileCommitProtocol logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                    Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                    log4j.logger.org.apache.spark.internal.io.FileCommitProtocol=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                    Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"HadoopMapRedCommitProtocol/","title":"HadoopMapRedCommitProtocol","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                    HadoopMapRedCommitProtocol is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"HadoopMapRedWriteConfigUtil/","title":"HadoopMapRedWriteConfigUtil","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                    HadoopMapRedWriteConfigUtil is a HadoopWriteConfigUtil for RDD.saveAsHadoopDataset operator.

                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"HadoopMapRedWriteConfigUtil/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                    HadoopMapRedWriteConfigUtil takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                    • SerializableJobConf

                                                                                                                                                                                                                                                                                                                                                                                                                                      HadoopMapRedWriteConfigUtil is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                      • PairRDDFunctions is requested to saveAsHadoopDataset
                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"HadoopMapRedWriteConfigUtil/#logging","title":"Logging","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                      Enable ALL logging level for org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                      Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                      logger.HadoopMapRedWriteConfigUtil.name = org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil\nlogger.HadoopMapRedWriteConfigUtil.level = all\n

                                                                                                                                                                                                                                                                                                                                                                                                                                      Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"HadoopMapReduceCommitProtocol/","title":"HadoopMapReduceCommitProtocol","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                      HadoopMapReduceCommitProtocol is a FileCommitProtocol.

                                                                                                                                                                                                                                                                                                                                                                                                                                      HadoopMapReduceCommitProtocol is a Serializable (Java) (to be sent out in tasks over the wire to executors).

                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"HadoopMapReduceCommitProtocol/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                      HadoopMapReduceCommitProtocol takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                      • Job ID
                                                                                                                                                                                                                                                                                                                                                                                                                                      • Path
                                                                                                                                                                                                                                                                                                                                                                                                                                      • dynamicPartitionOverwrite flag (default: false)

                                                                                                                                                                                                                                                                                                                                                                                                                                        HadoopMapReduceCommitProtocol is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                        • HadoopWriteConfigUtil is requested to create a committer
                                                                                                                                                                                                                                                                                                                                                                                                                                        • HadoopMapReduceWriteConfigUtil is requested to create a committer
                                                                                                                                                                                                                                                                                                                                                                                                                                        • HadoopMapRedWriteConfigUtil is requested to create a committer
                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"HadoopMapReduceCommitProtocol/#logging","title":"Logging","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                        Enable ALL logging level for org.apache.spark.internal.io.HadoopMapReduceCommitProtocol logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                        Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                        logger.HadoopMapReduceCommitProtocol.name = org.apache.spark.internal.io.HadoopMapReduceCommitProtocol\nlogger.HadoopMapReduceCommitProtocol.level = all\n

                                                                                                                                                                                                                                                                                                                                                                                                                                        Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"HadoopMapReduceWriteConfigUtil/","title":"HadoopMapReduceWriteConfigUtil","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                        HadoopMapReduceWriteConfigUtil is a HadoopWriteConfigUtil for RDD.saveAsNewAPIHadoopDataset operator.

                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"HadoopMapReduceWriteConfigUtil/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                        HadoopMapReduceWriteConfigUtil takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                        • SerializableConfiguration

                                                                                                                                                                                                                                                                                                                                                                                                                                          HadoopMapReduceWriteConfigUtil is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                          • PairRDDFunctions is requested to saveAsNewAPIHadoopDataset
                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"HadoopMapReduceWriteConfigUtil/#logging","title":"Logging","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                          Enable ALL logging level for org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                          Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                          logger.HadoopMapReduceWriteConfigUtil.name = org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil\nlogger.HadoopMapReduceWriteConfigUtil.level = all\n

                                                                                                                                                                                                                                                                                                                                                                                                                                          Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"HadoopWriteConfigUtil/","title":"HadoopWriteConfigUtil","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                          HadoopWriteConfigUtil[K, V] is an abstraction of writer configurers for SparkHadoopWriter to write a key-value RDD (for RDD.saveAsNewAPIHadoopDataset and RDD.saveAsHadoopDataset operators).

                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"HadoopWriteConfigUtil/#contract","title":"Contract","text":""},{"location":"HadoopWriteConfigUtil/#assertconf","title":"assertConf
                                                                                                                                                                                                                                                                                                                                                                                                                                          assertConf(\n  jobContext: JobContext,\n  conf: SparkConf): Unit\n
                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"HadoopWriteConfigUtil/#closewriter","title":"closeWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                          closeWriter(\n  taskContext: TaskAttemptContext): Unit\n
                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"HadoopWriteConfigUtil/#createcommitter","title":"createCommitter
                                                                                                                                                                                                                                                                                                                                                                                                                                          createCommitter(\n  jobId: Int): HadoopMapReduceCommitProtocol\n

                                                                                                                                                                                                                                                                                                                                                                                                                                          Creates a HadoopMapReduceCommitProtocol committer

                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkHadoopWriter is requested to write data out
                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"HadoopWriteConfigUtil/#createjobcontext","title":"createJobContext
                                                                                                                                                                                                                                                                                                                                                                                                                                          createJobContext(\n  jobTrackerId: String,\n  jobId: Int): JobContext\n
                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"HadoopWriteConfigUtil/#createtaskattemptcontext","title":"createTaskAttemptContext
                                                                                                                                                                                                                                                                                                                                                                                                                                          createTaskAttemptContext(\n  jobTrackerId: String,\n  jobId: Int,\n  splitId: Int,\n  taskAttemptId: Int): TaskAttemptContext\n

                                                                                                                                                                                                                                                                                                                                                                                                                                          Creates a Hadoop TaskAttemptContext

                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"HadoopWriteConfigUtil/#initoutputformat","title":"initOutputFormat
                                                                                                                                                                                                                                                                                                                                                                                                                                          initOutputFormat(\n  jobContext: JobContext): Unit\n
                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"HadoopWriteConfigUtil/#initwriter","title":"initWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                          initWriter(\n  taskContext: TaskAttemptContext,\n  splitId: Int): Unit\n
                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"HadoopWriteConfigUtil/#write","title":"write
                                                                                                                                                                                                                                                                                                                                                                                                                                          write(\n  pair: (K, V)): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                          Writes out the key-value pair

                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkHadoopWriter is requested to executeTask
                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"HadoopWriteConfigUtil/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                          • HadoopMapReduceWriteConfigUtil
                                                                                                                                                                                                                                                                                                                                                                                                                                          • HadoopMapRedWriteConfigUtil
                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"HeartbeatReceiver/","title":"HeartbeatReceiver RPC Endpoint","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                          HeartbeatReceiver is a ThreadSafeRpcEndpoint that is registered on the driver as HeartbeatReceiver.

                                                                                                                                                                                                                                                                                                                                                                                                                                          HeartbeatReceiver receives Heartbeat messages from executors for accumulator updates (with task metrics and a Spark application's accumulators) and pass them along to TaskScheduler.

                                                                                                                                                                                                                                                                                                                                                                                                                                          HeartbeatReceiver is registered immediately after a Spark application is started (i.e. when SparkContext is created).

                                                                                                                                                                                                                                                                                                                                                                                                                                          HeartbeatReceiver is a SparkListener to get notified about new executors or executors that are no longer available.

                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"HeartbeatReceiver/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                          HeartbeatReceiver takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkContext
                                                                                                                                                                                                                                                                                                                                                                                                                                          • Clock (default: SystemClock)

                                                                                                                                                                                                                                                                                                                                                                                                                                            HeartbeatReceiver is created\u00a0when SparkContext is created

                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"HeartbeatReceiver/#taskscheduler","title":"TaskScheduler

                                                                                                                                                                                                                                                                                                                                                                                                                                            HeartbeatReceiver manages a reference to TaskScheduler.

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"HeartbeatReceiver/#rpc-messages","title":"RPC Messages","text":""},{"location":"HeartbeatReceiver/#executorremoved","title":"ExecutorRemoved

                                                                                                                                                                                                                                                                                                                                                                                                                                            Attributes:

                                                                                                                                                                                                                                                                                                                                                                                                                                            • Executor ID

                                                                                                                                                                                                                                                                                                                                                                                                                                            Posted when HeartbeatReceiver is notified that an executor is no longer available

                                                                                                                                                                                                                                                                                                                                                                                                                                            When received, HeartbeatReceiver removes the executor (from executorLastSeen internal registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"HeartbeatReceiver/#executorregistered","title":"ExecutorRegistered

                                                                                                                                                                                                                                                                                                                                                                                                                                            Attributes:

                                                                                                                                                                                                                                                                                                                                                                                                                                            • Executor ID

                                                                                                                                                                                                                                                                                                                                                                                                                                            Posted when HeartbeatReceiver is notified that a new executor has been registered

                                                                                                                                                                                                                                                                                                                                                                                                                                            When received, HeartbeatReceiver registers the executor and the current time (in executorLastSeen internal registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"HeartbeatReceiver/#expiredeadhosts","title":"ExpireDeadHosts

                                                                                                                                                                                                                                                                                                                                                                                                                                            No attributes

                                                                                                                                                                                                                                                                                                                                                                                                                                            When received, HeartbeatReceiver prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                            Checking for hosts with no recent heartbeats in HeartbeatReceiver.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                            Each executor (in executorLastSeen internal registry) is checked whether the time it was last seen is not past spark.network.timeout.

                                                                                                                                                                                                                                                                                                                                                                                                                                            For any such executor, HeartbeatReceiver prints out the following WARN message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                            Removing executor [executorId] with no recent heartbeats: [time] ms exceeds timeout [timeout] ms\n

                                                                                                                                                                                                                                                                                                                                                                                                                                            HeartbeatReceiver TaskScheduler.executorLost (with SlaveLost(\"Executor heartbeat timed out after [timeout] ms\").

                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkContext.killAndReplaceExecutor is asynchronously called for the executor (i.e. on killExecutorThread).

                                                                                                                                                                                                                                                                                                                                                                                                                                            The executor is removed from the executorLastSeen internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"HeartbeatReceiver/#heartbeat","title":"Heartbeat

                                                                                                                                                                                                                                                                                                                                                                                                                                            Attributes:

                                                                                                                                                                                                                                                                                                                                                                                                                                            • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                                                                            • AccumulatorV2 updates (by task ID)
                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManagerId
                                                                                                                                                                                                                                                                                                                                                                                                                                            • ExecutorMetrics peaks (by stage and stage attempt IDs)

                                                                                                                                                                                                                                                                                                                                                                                                                                            Posted when Executor informs that it is alive and reports task metrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                            When received, HeartbeatReceiver finds the executorId executor (in executorLastSeen internal registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                            When the executor is found, HeartbeatReceiver updates the time the heartbeat was received (in executorLastSeen internal registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                            HeartbeatReceiver uses the Clock to know the current time.

                                                                                                                                                                                                                                                                                                                                                                                                                                            HeartbeatReceiver then submits an asynchronous task to notify TaskScheduler that the heartbeat was received from the executor (using TaskScheduler internal reference). HeartbeatReceiver posts a HeartbeatResponse back to the executor (with the response from TaskScheduler whether the executor has been registered already or not so it may eventually need to re-register).

                                                                                                                                                                                                                                                                                                                                                                                                                                            If however the executor was not found (in executorLastSeen internal registry), i.e. the executor was not registered before, you should see the following DEBUG message in the logs and the response is to notify the executor to re-register.

                                                                                                                                                                                                                                                                                                                                                                                                                                            Received heartbeat from unknown executor [executorId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                            In a very rare case, when TaskScheduler is not yet assigned to HeartbeatReceiver, you should see the following WARN message in the logs and the response is to notify the executor to re-register.

                                                                                                                                                                                                                                                                                                                                                                                                                                            Dropping [heartbeat] because TaskScheduler is not ready yet\n
                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"HeartbeatReceiver/#taskschedulerisset","title":"TaskSchedulerIsSet

                                                                                                                                                                                                                                                                                                                                                                                                                                            No attributes

                                                                                                                                                                                                                                                                                                                                                                                                                                            Posted when SparkContext informs that TaskScheduler is available.

                                                                                                                                                                                                                                                                                                                                                                                                                                            When received, HeartbeatReceiver sets the internal reference to TaskScheduler.

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"HeartbeatReceiver/#onexecutoradded","title":"onExecutorAdded
                                                                                                                                                                                                                                                                                                                                                                                                                                            onExecutorAdded(\n  executorAdded: SparkListenerExecutorAdded): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                            onExecutorAdded sends an ExecutorRegistered message to itself.

                                                                                                                                                                                                                                                                                                                                                                                                                                            onExecutorAdded\u00a0is part of the SparkListener abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"HeartbeatReceiver/#addexecutor","title":"addExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                            addExecutor(\n  executorId: String): Option[Future[Boolean]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                            addExecutor...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"HeartbeatReceiver/#onexecutorremoved","title":"onExecutorRemoved
                                                                                                                                                                                                                                                                                                                                                                                                                                            onExecutorRemoved(\n  executorRemoved: SparkListenerExecutorRemoved): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                            onExecutorRemoved removes the executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                            onExecutorRemoved\u00a0is part of the SparkListener abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"HeartbeatReceiver/#removeexecutor","title":"removeExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                            removeExecutor(\n  executorId: String): Option[Future[Boolean]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                            removeExecutor...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"HeartbeatReceiver/#starting-heartbeatreceiver","title":"Starting HeartbeatReceiver
                                                                                                                                                                                                                                                                                                                                                                                                                                            onStart(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                            onStart sends a blocking ExpireDeadHosts every spark.network.timeoutInterval on eventLoopThread.

                                                                                                                                                                                                                                                                                                                                                                                                                                            onStart\u00a0is part of the RpcEndpoint abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"HeartbeatReceiver/#stopping-heartbeatreceiver","title":"Stopping HeartbeatReceiver
                                                                                                                                                                                                                                                                                                                                                                                                                                            onStop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                            onStop shuts down the eventLoopThread and killExecutorThread thread pools.

                                                                                                                                                                                                                                                                                                                                                                                                                                            onStop\u00a0is part of the RpcEndpoint abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"HeartbeatReceiver/#handling-two-way-messages","title":"Handling Two-Way Messages
                                                                                                                                                                                                                                                                                                                                                                                                                                            receiveAndReply(\n  context: RpcCallContext): PartialFunction[Any, Unit]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                            receiveAndReply...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                            receiveAndReply\u00a0is part of the RpcEndpoint abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"HeartbeatReceiver/#thread-pools","title":"Thread Pools","text":""},{"location":"HeartbeatReceiver/#kill-executor-thread","title":"kill-executor-thread

                                                                                                                                                                                                                                                                                                                                                                                                                                            killExecutorThread is a daemon ScheduledThreadPoolExecutor with a single thread.

                                                                                                                                                                                                                                                                                                                                                                                                                                            The name of the thread pool is kill-executor-thread.

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"HeartbeatReceiver/#heartbeat-receiver-event-loop-thread","title":"heartbeat-receiver-event-loop-thread

                                                                                                                                                                                                                                                                                                                                                                                                                                            eventLoopThread is a daemon ScheduledThreadPoolExecutor with a single thread.

                                                                                                                                                                                                                                                                                                                                                                                                                                            The name of the thread pool is heartbeat-receiver-event-loop-thread.

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"HeartbeatReceiver/#expiring-dead-hosts","title":"Expiring Dead Hosts
                                                                                                                                                                                                                                                                                                                                                                                                                                            expireDeadHosts(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                            expireDeadHosts...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                            expireDeadHosts\u00a0is used when HeartbeatReceiver is requested to receives an ExpireDeadHosts message.

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"HeartbeatReceiver/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                            Enable ALL logging level for org.apache.spark.HeartbeatReceiver logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                            Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                            log4j.logger.org.apache.spark.HeartbeatReceiver=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                            Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"InterruptibleIterator/","title":"InterruptibleIterator","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[InterruptibleIterator]] InterruptibleIterator -- Iterator With Support For Task Cancellation

                                                                                                                                                                                                                                                                                                                                                                                                                                            InterruptibleIterator is a custom Scala https://www.scala-lang.org/api/2.11.x/index.html#scala.collection.Iterator[Iterator] that supports task cancellation, i.e. <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                            Quoting the official Scala https://www.scala-lang.org/api/2.11.x/index.html#scala.collection.Iterator[Iterator] documentation:

                                                                                                                                                                                                                                                                                                                                                                                                                                            Iterators are data structures that allow to iterate over a sequence of elements. They have a hasNext method for checking if there is a next element available, and a next method which returns the next element and discards it from the iterator.

                                                                                                                                                                                                                                                                                                                                                                                                                                            InterruptibleIterator is <> when:

                                                                                                                                                                                                                                                                                                                                                                                                                                            • RDD is requested to rdd:RDD.md#getOrCompute[get or compute a RDD partition]

                                                                                                                                                                                                                                                                                                                                                                                                                                            • CoGroupedRDD, rdd:HadoopRDD.md#compute[HadoopRDD], rdd:NewHadoopRDD.md#compute[NewHadoopRDD], rdd:ParallelCollectionRDD.md#compute[ParallelCollectionRDD] are requested to compute a partition

                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockStoreShuffleReader is requested to shuffle:BlockStoreShuffleReader.md#read[read combined key-value records for a reduce task]

                                                                                                                                                                                                                                                                                                                                                                                                                                            • PairRDDFunctions is requested to rdd:PairRDDFunctions.md#combineByKeyWithClassTag[combineByKeyWithClassTag]

                                                                                                                                                                                                                                                                                                                                                                                                                                            • Spark SQL's DataSourceRDD and JDBCRDD are requested to compute a partition

                                                                                                                                                                                                                                                                                                                                                                                                                                            • Spark SQL's RangeExec physical operator is requested to doExecute

                                                                                                                                                                                                                                                                                                                                                                                                                                            • PySpark's BasePythonRunner is requested to compute

                                                                                                                                                                                                                                                                                                                                                                                                                                            [[creating-instance]] InterruptibleIterator takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                            • [[context]] TaskContext
                                                                                                                                                                                                                                                                                                                                                                                                                                            • [[delegate]] Scala Iterator[T]

                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: InterruptibleIterator is a Developer API which is a lower-level, unstable API intended for Spark developers that may change or be removed in minor versions of Apache Spark.

                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[hasNext]] hasNext Method

                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"InterruptibleIterator/#source-scala","title":"[source, scala]","text":""},{"location":"InterruptibleIterator/#hasnext-boolean","title":"hasNext: Boolean","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: hasNext is part of ++https://www.scala-lang.org/api/2.11.x/index.html#scala.collection.Iterator@hasNext:Boolean++[Iterator Contract] to test whether this iterator can provide another element.

                                                                                                                                                                                                                                                                                                                                                                                                                                            hasNext requests the <> to kill the task if interrupted (that simply throws a TaskKilledException that in turn breaks the task execution).

                                                                                                                                                                                                                                                                                                                                                                                                                                            In the end, hasNext requests the <> to hasNext.

                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[next]] next Method

                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"InterruptibleIterator/#source-scala_1","title":"[source, scala]","text":""},{"location":"InterruptibleIterator/#next-t","title":"next(): T","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: next is part of ++https://www.scala-lang.org/api/2.11.x/index.html#scala.collection.Iterator@next():A++[Iterator Contract] to produce the next element of this iterator.

                                                                                                                                                                                                                                                                                                                                                                                                                                            next simply requests the <> to next."},{"location":"ListenerBus/","title":"ListenerBus","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                            ListenerBus is an abstraction of event buses that can notify listeners about scheduling events.

                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"ListenerBus/#contract","title":"Contract","text":""},{"location":"ListenerBus/#notifying-listener-about-event","title":"Notifying Listener about Event
                                                                                                                                                                                                                                                                                                                                                                                                                                            doPostEvent(\n  listener: L,\n  event: E): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when ListenerBus is requested to postToAll

                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"ListenerBus/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                            • ExecutionListenerBus
                                                                                                                                                                                                                                                                                                                                                                                                                                            • ExternalCatalogWithListener
                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkListenerBus
                                                                                                                                                                                                                                                                                                                                                                                                                                            • StreamingListenerBus
                                                                                                                                                                                                                                                                                                                                                                                                                                            • StreamingQueryListenerBus
                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"ListenerBus/#posting-event-to-all-listeners","title":"Posting Event To All Listeners
                                                                                                                                                                                                                                                                                                                                                                                                                                            postToAll(\n  event: E): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                            postToAll...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                            postToAll\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                            • AsyncEventQueue is requested to dispatch an event
                                                                                                                                                                                                                                                                                                                                                                                                                                            • ReplayListenerBus is requested to replay events
                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"ListenerBus/#registering-listener","title":"Registering Listener
                                                                                                                                                                                                                                                                                                                                                                                                                                            addListener(\n  listener: L): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                            addListener...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                            addListener\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                            • LiveListenerBus is requested to addToQueue
                                                                                                                                                                                                                                                                                                                                                                                                                                            • EventLogFileCompactor is requested to initializeBuilders
                                                                                                                                                                                                                                                                                                                                                                                                                                            • FsHistoryProvider is requested to doMergeApplicationListing and rebuildAppStore
                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"OutputCommitCoordinator/","title":"OutputCommitCoordinator","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                            From the scaladoc (it's a private[spark] class so no way to find it outside the code):

                                                                                                                                                                                                                                                                                                                                                                                                                                            Authority that decides whether tasks can commit output to HDFS. Uses a \"first committer wins\" policy.

                                                                                                                                                                                                                                                                                                                                                                                                                                            OutputCommitCoordinator is instantiated in both the drivers and executors. On executors, it is configured with a reference to the driver's OutputCommitCoordinatorEndpoint, so requests to commit output will be forwarded to the driver's OutputCommitCoordinator.

                                                                                                                                                                                                                                                                                                                                                                                                                                            This class was introduced in SPARK-4879; see that JIRA issue (and the associated pull requests) for an extensive design discussion.

                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"OutputCommitCoordinator/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                            OutputCommitCoordinator takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                            • isDriver flag

                                                                                                                                                                                                                                                                                                                                                                                                                                              OutputCommitCoordinator is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkEnv utility is used to create a SparkEnv on the driver
                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"OutputCommitCoordinator/#outputcommitcoordinator-rpc-endpoint","title":"OutputCommitCoordinator RPC Endpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                              coordinatorRef: Option[RpcEndpointRef]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              OutputCommitCoordinator is registered as OutputCommitCoordinator (with OutputCommitCoordinatorEndpoint RPC Endpoint) in the RPC Environment on the driver (when SparkEnv utility is used to create \"base\" SparkEnv). Executors have an RpcEndpointRef to the endpoint on the driver.

                                                                                                                                                                                                                                                                                                                                                                                                                                              coordinatorRef is used to post an AskPermissionToCommitOutput (by executors) to the OutputCommitCoordinator (when canCommit).

                                                                                                                                                                                                                                                                                                                                                                                                                                              coordinatorRef is used to stop the OutputCommitCoordinator on the driver (when stop).

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"OutputCommitCoordinator/#cancommit","title":"canCommit
                                                                                                                                                                                                                                                                                                                                                                                                                                              canCommit(\n  stage: Int,\n  stageAttempt: Int,\n  partition: Int,\n  attemptNumber: Int): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              canCommit creates a AskPermissionToCommitOutput message and sends it (asynchronously) to the OutputCommitCoordinator RPC Endpoint.

                                                                                                                                                                                                                                                                                                                                                                                                                                              canCommit\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkHadoopMapRedUtil is requested to commitTask (with spark.hadoop.outputCommitCoordination.enabled configuration property enabled)
                                                                                                                                                                                                                                                                                                                                                                                                                                              • DataWritingSparkTask (Spark SQL) utility is used to run
                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"OutputCommitCoordinator/#handleaskpermissiontocommit","title":"handleAskPermissionToCommit
                                                                                                                                                                                                                                                                                                                                                                                                                                              handleAskPermissionToCommit(\n  stage: Int,\n  stageAttempt: Int,\n  partition: Int,\n  attemptNumber: Int): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              handleAskPermissionToCommit...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                              handleAskPermissionToCommit\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                              • OutputCommitCoordinatorEndpoint is requested to handle a AskPermissionToCommitOutput message (that happens after it was sent out in canCommit)
                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"OutputCommitCoordinator/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                              Enable ALL logging level for org.apache.spark.scheduler.OutputCommitCoordinator logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                              Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                              log4j.logger.org.apache.spark.scheduler.OutputCommitCoordinator=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkConf/","title":"SparkConf","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkConf is Serializable (Java).

                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"SparkConf/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkConf takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                              • loadDefaults flag
                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"SparkConf/#loaddefaults-flag","title":"loadDefaults Flag

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkConf can be given loadDefaults flag when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                              When true, SparkConf loads spark properties (with silent flag disabled) when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkConf/#getallwithprefix","title":"getAllWithPrefix
                                                                                                                                                                                                                                                                                                                                                                                                                                              getAllWithPrefix(\n  prefix: String): Array[(String, String)]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              getAllWithPrefix collects the keys with the given prefix in getAll.

                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, getAllWithPrefix removes the given prefix from the keys.

                                                                                                                                                                                                                                                                                                                                                                                                                                              getAllWithPrefix is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkConf is requested to getExecutorEnv (spark.executorEnv. prefix), fillMissingMagicCommitterConfsIfNeeded (spark.hadoop.fs.s3a.bucket. prefix)
                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExecutorPluginContainer is requested for the executorPlugins (spark.plugins.internal.conf. prefix)
                                                                                                                                                                                                                                                                                                                                                                                                                                              • ResourceUtils is requested to parseResourceRequest, listResourceIds, addTaskResourceRequests, parseResourceRequirements
                                                                                                                                                                                                                                                                                                                                                                                                                                              • SortShuffleManager is requested to loadShuffleExecutorComponents (spark.shuffle.plugin.__config__. prefix)
                                                                                                                                                                                                                                                                                                                                                                                                                                              • ServerInfo is requested to addFilters
                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkConf/#loading-spark-properties","title":"Loading Spark Properties
                                                                                                                                                                                                                                                                                                                                                                                                                                              loadFromSystemProperties(\n  silent: Boolean): SparkConf\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              loadFromSystemProperties records all the spark.-prefixed system properties in this SparkConf.

                                                                                                                                                                                                                                                                                                                                                                                                                                              Silently loading system properties

                                                                                                                                                                                                                                                                                                                                                                                                                                              Loading system properties silently is possible using the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                              new SparkConf(loadDefaults = false).loadFromSystemProperties(silent = true)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              loadFromSystemProperties is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkConf is created (with loadDefaults enabled)
                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkHadoopUtil is created
                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkConf/#executor-settings","title":"Executor Settings

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkConf uses spark.executorEnv. prefix for executor settings.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkConf/#getexecutorenv","title":"getExecutorEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                              getExecutorEnv: Seq[(String, String)]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              getExecutorEnv gets all the settings with spark.executorEnv. prefix.

                                                                                                                                                                                                                                                                                                                                                                                                                                              getExecutorEnv is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkContext is created (and requested for executorEnvs)
                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkConf/#setexecutorenv","title":"setExecutorEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                              setExecutorEnv(\n  variables: Array[(String, String)]): SparkConf\nsetExecutorEnv(\n  variables: Seq[(String, String)]): SparkConf\nsetExecutorEnv(\n  variable: String, value: String): SparkConf\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              setExecutorEnv sets the given (key-value) variables with the keys with spark.executorEnv. prefix added.

                                                                                                                                                                                                                                                                                                                                                                                                                                              setExecutorEnv is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkContext is requested to updatedConf
                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkConf/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                              Enable ALL logging level for org.apache.spark.SparkConf logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                              Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                              log4j.logger.org.apache.spark.SparkConf=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/","title":"Inside Creating SparkContext","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                              This document describes the internals of what happens when a new SparkContext is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                              import org.apache.spark.{SparkConf, SparkContext}\n\n// 1. Create Spark configuration\nval conf = new SparkConf()\n  .setAppName(\"SparkMe Application\")\n  .setMaster(\"local[*]\")\n\n// 2. Create Spark context\nval sc = new SparkContext(conf)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"SparkContext-creating-instance-internals/#creationsite","title":"creationSite
                                                                                                                                                                                                                                                                                                                                                                                                                                              creationSite: CallSite\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext determines call site.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#assertondriver","title":"assertOnDriver

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#markpartiallyconstructed","title":"markPartiallyConstructed

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#starttime","title":"startTime
                                                                                                                                                                                                                                                                                                                                                                                                                                              startTime: Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext records the current time (in ms).

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#stopped","title":"stopped
                                                                                                                                                                                                                                                                                                                                                                                                                                              stopped: AtomicBoolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext initializes stopped flag to false.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#printing-out-spark-version","title":"Printing Out Spark Version

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                              Running Spark version [SPARK_VERSION]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#sparkuser","title":"sparkUser
                                                                                                                                                                                                                                                                                                                                                                                                                                              sparkUser: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext determines Spark user.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#sparkconf","title":"SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                              _conf: SparkConf\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext clones the SparkConf and requests it to validateSettings.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#enforcing-mandatory-configuration-properties","title":"Enforcing Mandatory Configuration Properties

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext asserts that spark.master and spark.app.name are defined (in the SparkConf).

                                                                                                                                                                                                                                                                                                                                                                                                                                              A master URL must be set in your configuration\n
                                                                                                                                                                                                                                                                                                                                                                                                                                              An application name must be set in your configuration\n
                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#driverlogger","title":"DriverLogger
                                                                                                                                                                                                                                                                                                                                                                                                                                              _driverLogger: Option[DriverLogger]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext creates a DriverLogger.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#resourceinformation","title":"ResourceInformation
                                                                                                                                                                                                                                                                                                                                                                                                                                              _resources: Map[String, ResourceInformation]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext uses spark.driver.resourcesFile configuration property to discovery driver resources and prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                              ==============================================================\nResources for [componentName]:\n[resources]\n==============================================================\n
                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#submitted-application","title":"Submitted Application

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext prints out the following INFO message to the logs (with the value of spark.app.name configuration property):

                                                                                                                                                                                                                                                                                                                                                                                                                                              Submitted application: [appName]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#spark-on-yarn-and-sparkyarnappid","title":"Spark on YARN and spark.yarn.app.id

                                                                                                                                                                                                                                                                                                                                                                                                                                              For Spark on YARN in cluster deploy mode], SparkContext checks whether spark.yarn.app.id configuration property is defined. SparkException is thrown if it does not exist.

                                                                                                                                                                                                                                                                                                                                                                                                                                              Detected yarn cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#displaying-spark-configuration","title":"Displaying Spark Configuration

                                                                                                                                                                                                                                                                                                                                                                                                                                              With spark.logConf configuration property enabled, SparkContext prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                              Spark configuration:\n[conf.toDebugString]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkConf.toDebugString is used very early in the initialization process and other settings configured afterwards are not included. Use SparkContext.getConf.toDebugString once SparkContext is initialized.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#setting-configuration-properties","title":"Setting Configuration Properties
                                                                                                                                                                                                                                                                                                                                                                                                                                              • spark.driver.host to the current value of the property (to override the default)
                                                                                                                                                                                                                                                                                                                                                                                                                                              • spark.driver.port to 0 unless defined already
                                                                                                                                                                                                                                                                                                                                                                                                                                              • spark.executor.id to driver
                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#user-defined-jar-files","title":"User-Defined Jar Files
                                                                                                                                                                                                                                                                                                                                                                                                                                              _jars: Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext sets the _jars to spark.jars configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#user-defined-files","title":"User-Defined Files
                                                                                                                                                                                                                                                                                                                                                                                                                                              _files: Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext sets the _files to spark.files configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#sparkeventlogdir","title":"spark.eventLog.dir
                                                                                                                                                                                                                                                                                                                                                                                                                                              _eventLogDir: Option[URI]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              If spark-history-server:EventLoggingListener.md[event logging] is enabled, i.e. EventLoggingListener.md#spark_eventLog_enabled[spark.eventLog.enabled] flag is true, the internal field _eventLogDir is set to the value of EventLoggingListener.md#spark_eventLog_dir[spark.eventLog.dir] setting or the default value /tmp/spark-events.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#sparkeventlogcompress","title":"spark.eventLog.compress
                                                                                                                                                                                                                                                                                                                                                                                                                                              _eventLogCodec: Option[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              Also, if spark-history-server:EventLoggingListener.md#spark_eventLog_compress[spark.eventLog.compress] is enabled (it is not by default), the short name of the CompressionCodec is assigned to _eventLogCodec. The config key is spark.io.compression.codec (default: lz4).

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#creating-livelistenerbus","title":"Creating LiveListenerBus
                                                                                                                                                                                                                                                                                                                                                                                                                                              _listenerBus: LiveListenerBus\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext creates a LiveListenerBus.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#creating-appstatusstore-and-appstatussource","title":"Creating AppStatusStore (and AppStatusSource)
                                                                                                                                                                                                                                                                                                                                                                                                                                              _statusStore: AppStatusStore\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext creates an in-memory store (with an optional AppStatusSource if enabled) and requests the LiveListenerBus to register the AppStatusListener with the status queue.

                                                                                                                                                                                                                                                                                                                                                                                                                                              The AppStatusStore is available using the statusStore property of the SparkContext.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#creating-sparkenv","title":"Creating SparkEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                              _env: SparkEnv\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext creates a SparkEnv and requests SparkEnv to use the instance as the default SparkEnv.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#sparkreplclassuri","title":"spark.repl.class.uri

                                                                                                                                                                                                                                                                                                                                                                                                                                              With spark.repl.class.outputDir configuration property defined, SparkContext sets spark.repl.class.uri configuration property to be...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#creating-sparkstatustracker","title":"Creating SparkStatusTracker
                                                                                                                                                                                                                                                                                                                                                                                                                                              _statusTracker: SparkStatusTracker\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext creates a SparkStatusTracker (with itself and the AppStatusStore).

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#creating-consoleprogressbar","title":"Creating ConsoleProgressBar
                                                                                                                                                                                                                                                                                                                                                                                                                                              _progressBar: Option[ConsoleProgressBar]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext creates a ConsoleProgressBar only when spark.ui.showConsoleProgress configuration property is enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#creating-sparkui","title":"Creating SparkUI
                                                                                                                                                                                                                                                                                                                                                                                                                                              _ui: Option[SparkUI]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext creates a SparkUI only when spark.ui.enabled configuration property is enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext requests the SparkUI to bind.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#hadoop-configuration","title":"Hadoop Configuration
                                                                                                                                                                                                                                                                                                                                                                                                                                              _hadoopConfiguration: Configuration\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext creates a new Hadoop Configuration.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#adding-user-defined-jar-files","title":"Adding User-Defined Jar Files

                                                                                                                                                                                                                                                                                                                                                                                                                                              If there are jars given through the SparkContext constructor, they are added using addJar.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#adding-user-defined-files","title":"Adding User-Defined Files

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext adds the files in spark.files configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#_executormemory","title":"_executorMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                              _executorMemory: Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext determines the amount of memory to allocate to each executor. It is the value of executor:Executor.md#spark.executor.memory[spark.executor.memory] setting, or SparkContext.md#environment-variables[SPARK_EXECUTOR_MEMORY] environment variable (or currently-deprecated SPARK_MEM), or defaults to 1024.

                                                                                                                                                                                                                                                                                                                                                                                                                                              _executorMemory is later available as sc.executorMemory and used for LOCAL_CLUSTER_REGEX, SparkDeploySchedulerBackend, to set executorEnvs(\"SPARK_EXECUTOR_MEMORY\"), MesosSchedulerBackend, CoarseMesosSchedulerBackend.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#spark_prepend_classes-environment-variable","title":"SPARK_PREPEND_CLASSES Environment Variable

                                                                                                                                                                                                                                                                                                                                                                                                                                              The value of SPARK_PREPEND_CLASSES environment variable is included in executorEnvs.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#for-mesos-schedulerbackend-only","title":"For Mesos SchedulerBackend Only

                                                                                                                                                                                                                                                                                                                                                                                                                                              The Mesos scheduler backend's configuration is included in executorEnvs, i.e. SparkContext.md#environment-variables[SPARK_EXECUTOR_MEMORY], _conf.getExecutorEnv, and SPARK_USER.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#shuffledrivercomponents","title":"ShuffleDriverComponents
                                                                                                                                                                                                                                                                                                                                                                                                                                              _shuffleDriverComponents: ShuffleDriverComponents\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#registering-heartbeatreceiver","title":"Registering HeartbeatReceiver

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext registers HeartbeatReceiver RPC endpoint.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#plugincontainer","title":"PluginContainer
                                                                                                                                                                                                                                                                                                                                                                                                                                              _plugins: Option[PluginContainer]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext creates a PluginContainer (with itself and the _resources).

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#creating-schedulerbackend-and-taskscheduler","title":"Creating SchedulerBackend and TaskScheduler

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext object is requested to SparkContext.md#createTaskScheduler[create the SchedulerBackend with the TaskScheduler] (for the given master URL) and the result becomes the internal _schedulerBackend and _taskScheduler.

                                                                                                                                                                                                                                                                                                                                                                                                                                              scheduler:DAGScheduler.md#creating-instance[DAGScheduler is created] (as _dagScheduler).

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#sending-blocking-taskschedulerisset","title":"Sending Blocking TaskSchedulerIsSet

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext sends a blocking TaskSchedulerIsSet message to HeartbeatReceiver RPC endpoint (to inform that the TaskScheduler is now available).

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#executormetricssource","title":"ExecutorMetricsSource

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext creates an ExecutorMetricsSource when the spark.metrics.executorMetricsSource.enabled is enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#heartbeater","title":"Heartbeater

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext creates a Heartbeater and starts it.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#starting-taskscheduler","title":"Starting TaskScheduler

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext requests the TaskScheduler to start.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#setting-spark-applications-and-execution-attempts-ids","title":"Setting Spark Application's and Execution Attempt's IDs

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext sets the internal fields -- _applicationId and _applicationAttemptId -- (using applicationId and applicationAttemptId methods from the scheduler:TaskScheduler.md#contract[TaskScheduler Contract]).

                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: SparkContext requests TaskScheduler for the scheduler:TaskScheduler.md#applicationId[unique identifier of a Spark application] (that is currently only implemented by scheduler:TaskSchedulerImpl.md#applicationId[TaskSchedulerImpl] that uses SchedulerBackend to scheduler:SchedulerBackend.md#applicationId[request the identifier]).

                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: The unique identifier of a Spark application is used to initialize spark-webui-SparkUI.md#setAppId[SparkUI] and storage:BlockManager.md#initialize[BlockManager].

                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: _applicationAttemptId is used when SparkContext is requested for the SparkContext.md#applicationAttemptId[unique identifier of execution attempt of a Spark application] and when EventLoggingListener spark-history-server:EventLoggingListener.md#creating-instance[is created].

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#setting-sparkappid-spark-property-in-sparkconf","title":"Setting spark.app.id Spark Property in SparkConf

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext sets SparkConf.md#spark.app.id[spark.app.id] property to be the <<_applicationId, unique identifier of a Spark application>> and, if enabled, spark-webui-SparkUI.md#setAppId[passes it on to SparkUI].

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#sparkuiproxybase","title":"spark.ui.proxyBase","text":""},{"location":"SparkContext-creating-instance-internals/#initializing-sparkui","title":"Initializing SparkUI

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext requests the SparkUI (if defined) to setAppId with the _applicationId.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#initializing-blockmanager","title":"Initializing BlockManager

                                                                                                                                                                                                                                                                                                                                                                                                                                              The storage:BlockManager.md#initialize[BlockManager (for the driver) is initialized] (with _applicationId).

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#starting-metricssystem","title":"Starting MetricsSystem

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext requests the MetricsSystem to start (with the value of thespark.metrics.staticSources.enabled configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext starts the MetricsSystem after <> as MetricsSystem uses it to build unique identifiers fo metrics sources.","text":""},{"location":"SparkContext-creating-instance-internals/#attaching-servlet-handlers-to-web-ui","title":"Attaching Servlet Handlers to web UI

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext requests the MetricsSystem for servlet handlers and requests the SparkUI to attach them.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#starting-eventlogginglistener-with-event-log-enabled","title":"Starting EventLoggingListener (with Event Log Enabled)
                                                                                                                                                                                                                                                                                                                                                                                                                                              _eventLogger: Option[EventLoggingListener]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              With spark.eventLog.enabled configuration property enabled, SparkContext creates an EventLoggingListener and requests it to start.

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext requests the LiveListenerBus to add the EventLoggingListener to eventLog event queue.

                                                                                                                                                                                                                                                                                                                                                                                                                                              With spark.eventLog.enabled disabled, _eventLogger is None (undefined).

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#contextcleaner","title":"ContextCleaner
                                                                                                                                                                                                                                                                                                                                                                                                                                              _cleaner: Option[ContextCleaner]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              With spark.cleaner.referenceTracking configuration property enabled, SparkContext creates a ContextCleaner (with itself and the _shuffleDriverComponents).

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext requests the ContextCleaner to start

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#executorallocationmanager","title":"ExecutorAllocationManager
                                                                                                                                                                                                                                                                                                                                                                                                                                              _executorAllocationManager: Option[ExecutorAllocationManager]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext initializes _executorAllocationManager internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext creates an ExecutorAllocationManager when:

                                                                                                                                                                                                                                                                                                                                                                                                                                              • Dynamic Allocation of Executors is enabled (based on spark.dynamicAllocation.enabled configuration property and the master URL)

                                                                                                                                                                                                                                                                                                                                                                                                                                              • SchedulerBackend is an ExecutorAllocationClient

                                                                                                                                                                                                                                                                                                                                                                                                                                              The ExecutorAllocationManager is requested to start.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#registering-user-defined-sparklisteners","title":"Registering User-Defined SparkListeners

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext registers user-defined listeners and starts SparkListenerEvent event delivery to the listeners.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#postenvironmentupdate","title":"postEnvironmentUpdate

                                                                                                                                                                                                                                                                                                                                                                                                                                              postEnvironmentUpdate is called that posts SparkListener.md#SparkListenerEnvironmentUpdate[SparkListenerEnvironmentUpdate] message on scheduler:LiveListenerBus.md[] with information about Task Scheduler's scheduling mode, added jar and file paths, and other environmental details.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#postapplicationstart","title":"postApplicationStart

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkListener.md#SparkListenerApplicationStart[SparkListenerApplicationStart] message is posted to scheduler:LiveListenerBus.md[] (using the internal postApplicationStart method).

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#poststarthook","title":"postStartHook

                                                                                                                                                                                                                                                                                                                                                                                                                                              TaskScheduler scheduler:TaskScheduler.md#postStartHook[is notified that SparkContext is almost fully initialized].

                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: scheduler:TaskScheduler.md#postStartHook[TaskScheduler.postStartHook] does nothing by default, but custom implementations offer more advanced features, i.e. TaskSchedulerImpl scheduler:TaskSchedulerImpl.md#postStartHook[blocks the current thread until SchedulerBackend is ready]. There is also YarnClusterScheduler for Spark on YARN in cluster deploy mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#registering-metrics-sources","title":"Registering Metrics Sources

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext requests MetricsSystem to register metrics sources for the following services:

                                                                                                                                                                                                                                                                                                                                                                                                                                              • DAGScheduler
                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockManager
                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExecutorAllocationManager
                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#adding-shutdown-hook","title":"Adding Shutdown Hook

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext adds a shutdown hook (using ShutdownHookManager.addShutdownHook()).

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                              Adding shutdown hook\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              CAUTION: FIXME ShutdownHookManager.addShutdownHook()

                                                                                                                                                                                                                                                                                                                                                                                                                                              Any non-fatal Exception leads to termination of the Spark context instance.

                                                                                                                                                                                                                                                                                                                                                                                                                                              CAUTION: FIXME What does NonFatal represent in Scala?

                                                                                                                                                                                                                                                                                                                                                                                                                                              CAUTION: FIXME Finish me

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#initializing-nextshuffleid-and-nextrddid-internal-counters","title":"Initializing nextShuffleId and nextRddId Internal Counters

                                                                                                                                                                                                                                                                                                                                                                                                                                              nextShuffleId and nextRddId start with 0.

                                                                                                                                                                                                                                                                                                                                                                                                                                              CAUTION: FIXME Where are nextShuffleId and nextRddId used?

                                                                                                                                                                                                                                                                                                                                                                                                                                              A new instance of Spark context is created and ready for operation.

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#loading-external-cluster-manager-for-url-getclustermanager-method","title":"Loading External Cluster Manager for URL (getClusterManager method)
                                                                                                                                                                                                                                                                                                                                                                                                                                              getClusterManager(\n  url: String): Option[ExternalClusterManager]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              getClusterManager loads scheduler:ExternalClusterManager.md[] that scheduler:ExternalClusterManager.md#canCreate[can handle the input url].

                                                                                                                                                                                                                                                                                                                                                                                                                                              If there are two or more external cluster managers that could handle url, a SparkException is thrown:

                                                                                                                                                                                                                                                                                                                                                                                                                                              Multiple Cluster Managers ([serviceLoaders]) registered for the url [url].\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: getClusterManager uses Java's ++https://docs.oracle.com/javase/8/docs/api/java/util/ServiceLoader.html#load-java.lang.Class-java.lang.ClassLoader-++[ServiceLoader.load] method.

                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: getClusterManager is used to find a cluster manager for a master URL when SparkContext.md#createTaskScheduler[creating a SchedulerBackend and a TaskScheduler for the driver].

                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext-creating-instance-internals/#setupandstartlistenerbus","title":"setupAndStartListenerBus
                                                                                                                                                                                                                                                                                                                                                                                                                                              setupAndStartListenerBus(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              setupAndStartListenerBus is an internal method that reads configuration-properties.md#spark.extraListeners[spark.extraListeners] configuration property from the current SparkConf.md[SparkConf] to create and register SparkListenerInterface listeners.

                                                                                                                                                                                                                                                                                                                                                                                                                                              It expects that the class name represents a SparkListenerInterface listener with one of the following constructors (in this order):

                                                                                                                                                                                                                                                                                                                                                                                                                                              • a single-argument constructor that accepts SparkConf.md[SparkConf]
                                                                                                                                                                                                                                                                                                                                                                                                                                              • a zero-argument constructor

                                                                                                                                                                                                                                                                                                                                                                                                                                              setupAndStartListenerBus scheduler:LiveListenerBus.md#ListenerBus-addListener[registers every listener class].

                                                                                                                                                                                                                                                                                                                                                                                                                                              You should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                              INFO Registered listener [className]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              It scheduler:LiveListenerBus.md#start[starts LiveListenerBus] and records it in the internal _listenerBusStarted.

                                                                                                                                                                                                                                                                                                                                                                                                                                              When no single-SparkConf or zero-argument constructor could be found for a class name in configuration-properties.md#spark.extraListeners[spark.extraListeners] configuration property, a SparkException is thrown with the message:

                                                                                                                                                                                                                                                                                                                                                                                                                                              [className] did not have a zero-argument constructor or a single-argument constructor that accepts SparkConf. Note: if the class is defined inside of another Scala class, then its constructors may accept an implicit parameter that references the enclosing class; in this case, you must define the listener as a top-level class in order to prevent this extra parameter from breaking Spark's ability to find a valid constructor.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              Any exception while registering a SparkListenerInterface listener stops the SparkContext and a SparkException is thrown and the source exception's message.

                                                                                                                                                                                                                                                                                                                                                                                                                                              Exception when registering SparkListener\n

                                                                                                                                                                                                                                                                                                                                                                                                                                              Tip

                                                                                                                                                                                                                                                                                                                                                                                                                                              Set INFO logging level for org.apache.spark.SparkContext logger to see the extra listeners being registered.

                                                                                                                                                                                                                                                                                                                                                                                                                                              Registered listener pl.japila.spark.CustomSparkListener\n
                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"SparkContext/","title":"SparkContext","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext is the entry point to all of the components of Apache Spark (execution engine) and so the heart of a Spark application. In fact, you can consider an application a Spark application only when it uses a SparkContext (directly or indirectly).

                                                                                                                                                                                                                                                                                                                                                                                                                                              Important

                                                                                                                                                                                                                                                                                                                                                                                                                                              There should be one active SparkContext per JVM and Spark developers should use SparkContext.getOrCreate utility for sharing it (e.g. across threads).

                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"SparkContext/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkConf

                                                                                                                                                                                                                                                                                                                                                                                                                                                SparkContext is created (directly or indirectly using getOrCreate utility).

                                                                                                                                                                                                                                                                                                                                                                                                                                                While being created, SparkContext sets up core services and establishes a connection to a cluster manager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"SparkContext/#checkpoint-directory","title":"Checkpoint Directory

                                                                                                                                                                                                                                                                                                                                                                                                                                                SparkContext defines checkpointDir internal registry for the path to a checkpoint directory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                checkpointDir is undefined (None) when SparkContext is created and is set using setCheckpointDir.

                                                                                                                                                                                                                                                                                                                                                                                                                                                checkpointDir is required for Reliable Checkpointing.

                                                                                                                                                                                                                                                                                                                                                                                                                                                checkpointDir is available using getCheckpointDir.

                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"SparkContext/#getcheckpointdir","title":"getCheckpointDir
                                                                                                                                                                                                                                                                                                                                                                                                                                                getCheckpointDir: Option[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                getCheckpointDir returns the checkpointDir.

                                                                                                                                                                                                                                                                                                                                                                                                                                                getCheckpointDir is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                • ReliableRDDCheckpointData is requested for the checkpoint path
                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"SparkContext/#submitting-mapstage-for-execution","title":"Submitting MapStage for Execution
                                                                                                                                                                                                                                                                                                                                                                                                                                                submitMapStage[K, V, C](\n  dependency: ShuffleDependency[K, V, C]): SimpleFutureAction[MapOutputStatistics]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                submitMapStage requests the DAGScheduler to submit the given ShuffleDependency for execution (that eventually produces a MapOutputStatistics).

                                                                                                                                                                                                                                                                                                                                                                                                                                                submitMapStage is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                • ShuffleExchangeExec (Spark SQL) unary physical operator is executed
                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"SparkContext/#executormetricssource","title":"ExecutorMetricsSource

                                                                                                                                                                                                                                                                                                                                                                                                                                                SparkContext creates an ExecutorMetricsSource when created with spark.metrics.executorMetricsSource.enabled enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                SparkContext requests the ExecutorMetricsSource to register with the MetricsSystem.

                                                                                                                                                                                                                                                                                                                                                                                                                                                SparkContext uses the ExecutorMetricsSource to create the Heartbeater.

                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"SparkContext/#services","title":"Services
                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExecutorAllocationManager (optional)

                                                                                                                                                                                                                                                                                                                                                                                                                                                • SchedulerBackend","text":""},{"location":"SparkContext/#resourceprofilemanager","title":"ResourceProfileManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkContext creates a ResourceProfileManager when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#resourceprofilemanager_1","title":"resourceProfileManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceProfileManager: ResourceProfileManager\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceProfileManager returns the ResourceProfileManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceProfileManager is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • KubernetesClusterSchedulerBackend (Spark on Kubernetes) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • others
                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#driverlogger","title":"DriverLogger

                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkContext can create a DriverLogger when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkContext requests the DriverLogger to startSync in postApplicationStart.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#appstatussource","title":"AppStatusSource

                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkContext can create an AppStatusSource when created (based on the spark.metrics.appStatusSource.enabled configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkContext uses the AppStatusSource to create the AppStatusStore.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  If configured, SparkContext registers the AppStatusSource with the MetricsSystem.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#appstatusstore","title":"AppStatusStore

                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkContext creates an AppStatusStore when created (with itself and the AppStatusSource).

                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkContext requests AppStatusStore for the AppStatusListener and requests the LiveListenerBus to add it to the application status queue.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkContext uses the AppStatusStore to create the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkStatusTracker
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkUI

                                                                                                                                                                                                                                                                                                                                                                                                                                                  AppStatusStore is requested to status/AppStatusStore.md#close in stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#statusstore","title":"statusStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                  statusStore: AppStatusStore\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  statusStore returns the AppStatusStore.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  statusStore is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is requested to getRDDStorageInfo
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ConsoleProgressBar is requested to refresh
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • HiveThriftServer2 is requested to createListenerAndUI
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SharedState (Spark SQL) is requested for a SQLAppStatusStore and a StreamingQueryStatusListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#sparkstatustracker","title":"SparkStatusTracker

                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkContext creates a SparkStatusTracker when created (with itself and the AppStatusStore).

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#statustracker","title":"statusTracker
                                                                                                                                                                                                                                                                                                                                                                                                                                                  statusTracker: SparkStatusTracker\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  statusTracker returns the SparkStatusTracker.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#local-properties","title":"Local Properties
                                                                                                                                                                                                                                                                                                                                                                                                                                                  localProperties: InheritableThreadLocal[Properties]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkContext uses an InheritableThreadLocal (Java) of key-value pairs of thread-local properties to pass extra information from a parent thread (on the driver) to child threads.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  localProperties is meant to be used by developers using SparkContext.setLocalProperty and SparkContext.getLocalProperty.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Local Properties are available using TaskContext.getLocalProperty.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Local Properties are available to SparkListeners using the following events:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkListenerJobStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkListenerStageSubmitted

                                                                                                                                                                                                                                                                                                                                                                                                                                                  localProperties are passed down when SparkContext is requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Running Job (that in turn makes the local properties available to the DAGScheduler to run a job)
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Running Approximate Job
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Submitting Job
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Submitting MapStage

                                                                                                                                                                                                                                                                                                                                                                                                                                                  DAGScheduler passes down local properties when scheduling:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ShuffleMapTasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ResultTasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TaskSets

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Spark (Core) defines the following local properties.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Name Default Value Setter callSite.long callSite.short SparkContext.setCallSite spark.job.description callSite.short SparkContext.setJobDescription (SparkContext.setJobGroup) spark.job.interruptOnCancel SparkContext.setJobGroup spark.jobGroup.id SparkContext.setJobGroup spark.scheduler.pool","text":""},{"location":"SparkContext/#shuffledrivercomponents","title":"ShuffleDriverComponents

                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkContext creates a ShuffleDriverComponents when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkContext loads the ShuffleDataIO that is in turn requested for the ShuffleDriverComponents. SparkContext requests the ShuffleDriverComponents to initialize.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  The ShuffleDriverComponents is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ShuffleDependency is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext creates the ContextCleaner (if enabled)

                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkContext requests the ShuffleDriverComponents to clean up when stopping.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#static-files","title":"Static Files","text":""},{"location":"SparkContext/#addfile","title":"addFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                  addFile(\n  path: String,\n  recursive: Boolean): Unit\n// recursive = false\naddFile(\n  path: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Firstly, addFile validate the schema of given path. For a no-schema path, addFile converts it to a canonical form. For a local schema path, addFile prints out the following WARN message to the logs and exits.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  File with 'local' scheme is not supported to add to file server, since it is already available on every node.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                  And for other schema path, addFile creates a Hadoop Path from the given path.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  addFile Will validate the URL if the path is an HTTP, HTTPS or FTP URI.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  addFile Will throw SparkException with below message if path is local directories but not in local mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  addFile does not support local directories when not running local mode.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  addFile Will throw SparkException with below message if path is directories but not turn on recursive flag.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Added file $hadoopPath is a directory and recursive is not turned on.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, addFile adds the file to the addedFiles internal registry (with the current timestamp):

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • For new files, addFile prints out the following INFO message to the logs, fetches the file (to the root directory and without using the cache) and postEnvironmentUpdate.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Added file [path] at [key] with timestamp [timestamp]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • For files that were already added, addFile prints out the following WARN message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    The path [path] has been added already. Overwriting of added paths is not supported in the current version.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  addFile is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#listfiles","title":"listFiles
                                                                                                                                                                                                                                                                                                                                                                                                                                                  listFiles(): Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  listFiles is the files added.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#addedfiles-internal-registry","title":"addedFiles Internal Registry
                                                                                                                                                                                                                                                                                                                                                                                                                                                  addedFiles: Map[String, Long]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  addedFiles is a collection of static files by the timestamp the were added at.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  addedFiles is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is requested to postEnvironmentUpdate and listFiles
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TaskSetManager is created (and resourceOffer)
                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#files","title":"files
                                                                                                                                                                                                                                                                                                                                                                                                                                                  files: Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  files is a collection of file paths defined by spark.files configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#posting-sparklistenerenvironmentupdate-event","title":"Posting SparkListenerEnvironmentUpdate Event
                                                                                                                                                                                                                                                                                                                                                                                                                                                  postEnvironmentUpdate(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  postEnvironmentUpdate...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                  postEnvironmentUpdate is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is requested to addFile and addJar
                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#getorcreate-utility","title":"getOrCreate Utility
                                                                                                                                                                                                                                                                                                                                                                                                                                                  getOrCreate(): SparkContext\ngetOrCreate(\n  config: SparkConf): SparkContext\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  getOrCreate...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#plugincontainer","title":"PluginContainer

                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkContext creates a PluginContainer when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  PluginContainer is created (for the driver where SparkContext lives) using PluginContainer.apply utility.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  PluginContainer is then requested to registerMetrics with the applicationId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  PluginContainer is requested to shutdown when SparkContext is requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#creating-schedulerbackend-and-taskscheduler","title":"Creating SchedulerBackend and TaskScheduler
                                                                                                                                                                                                                                                                                                                                                                                                                                                  createTaskScheduler(\n  sc: SparkContext,\n  master: String,\n  deployMode: String): (SchedulerBackend, TaskScheduler)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  createTaskScheduler creates a SchedulerBackend and a TaskScheduler for the given master URL and deployment mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Internally, createTaskScheduler branches off per the given master URL to select the requested implementations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  createTaskScheduler accepts the following master URLs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • local - local mode with 1 thread only
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • local[n] or local[*] - local mode with n threads
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • local[n, m] or local[*, m] -- local mode with n threads and m number of failures
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • spark://hostname:port for Spark Standalone
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • local-cluster[n, m, z] -- local cluster with n workers, m cores per worker, and z memory per worker
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Other URLs are simply handed over to getClusterManager to load an external cluster manager if available

                                                                                                                                                                                                                                                                                                                                                                                                                                                  createTaskScheduler is used when SparkContext is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#loading-externalclustermanager","title":"Loading ExternalClusterManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                  getClusterManager(\n  url: String): Option[ExternalClusterManager]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  getClusterManager uses Java's ServiceLoader to find and load an ExternalClusterManager that supports the given master URL.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExternalClusterManager Service Discovery

                                                                                                                                                                                                                                                                                                                                                                                                                                                  For ServiceLoader to find ExternalClusterManagers, they have to be registered using the following file:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  META-INF/services/org.apache.spark.scheduler.ExternalClusterManager\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  getClusterManager throws a SparkException when multiple cluster managers were found:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Multiple external cluster managers registered for the url [url]: [serviceLoaders]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  getClusterManager\u00a0is used when SparkContext is requested for a SchedulerBackend and TaskScheduler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#runJob","title":"Running Job (Synchronously)
                                                                                                                                                                                                                                                                                                                                                                                                                                                  runJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) => U): Array[U]\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  processPartition: (TaskContext, Iterator[T]) => U,\n  resultHandler: (Int, U) => Unit): Unit\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) => U,\n  partitions: Seq[Int]): Array[U]\nrunJob[T, U: ClassTag]( // (1)!\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) => U,\n  partitions: Seq[Int],\n  resultHandler: (Int, U) => Unit): Unit\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: Iterator[T] => U): Array[U]\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  processPartition: Iterator[T] => U,\n  resultHandler: (Int, U) => Unit): Unit\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: Iterator[T] => U,\n  partitions: Seq[Int]): Array[U]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                  1. Requests the DAGScheduler to run a job

                                                                                                                                                                                                                                                                                                                                                                                                                                                  runJob determines the call site and cleans up the given func function.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  runJob prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Starting job: [callSite]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  With spark.logLineage enabled, runJob requests the given RDD for the recursive dependencies and prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  RDD's recursive dependencies:\n[toDebugString]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  runJob requests the DAGScheduler to run a job with the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • The given rdd
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • The given func cleaned up
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • The given partitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • The call site
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • The given resultHandler function (procedure)
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • The local properties

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                  runJob is blocked until the job has finished (regardless of the result, successful or not).

                                                                                                                                                                                                                                                                                                                                                                                                                                                  runJob requests the ConsoleProgressBar (if available) to finishAll.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, runJob requests the given RDD to doCheckpoint.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#runJob-demo","title":"Demo

                                                                                                                                                                                                                                                                                                                                                                                                                                                  runJob is essentially executing a func function on all or a subset of partitions of an RDD and returning the result as an array (with elements being the results per partition).

                                                                                                                                                                                                                                                                                                                                                                                                                                                  sc.setLocalProperty(\"callSite.short\", \"runJob Demo\")\n\nval partitionsNumber = 4\nval rdd = sc.parallelize(\n  Seq(\"hello world\", \"nice to see you\"),\n  numSlices = partitionsNumber)\n\nimport org.apache.spark.TaskContext\nval func = (t: TaskContext, ss: Iterator[String]) => 1\nval result = sc.runJob(rdd, func)\nassert(result.length == partitionsNumber)\n\nsc.clearCallSite()\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#call-site","title":"Call Site
                                                                                                                                                                                                                                                                                                                                                                                                                                                  getCallSite(): CallSite\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  getCallSite...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                  getCallSite\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is requested to broadcast, runJob, runApproximateJob, submitJob and submitMapStage
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • AsyncRDDActions is requested to takeAsync
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • RDD is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#closure-cleaning","title":"Closure Cleaning
                                                                                                                                                                                                                                                                                                                                                                                                                                                  clean(\n  f: F,\n  checkSerializable: Boolean = true): F\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  clean cleans up the given f closure (using ClosureCleaner.clean utility).

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Tip

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Enable DEBUG logging level for org.apache.spark.util.ClosureCleaner logger to see what happens inside the class.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  log4j.logger.org.apache.spark.util.ClosureCleaner=DEBUG\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  With DEBUG logging level you should see the following messages in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  +++ Cleaning closure [func] ([func.getClass.getName]) +++\n + declared fields: [declaredFields.size]\n     [field]\n ...\n+++ closure [func] ([func.getClass.getName]) is now cleaned +++\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#maxNumConcurrentTasks","title":"Maximum Number of Concurrent Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                  maxNumConcurrentTasks(\n  rp: ResourceProfile): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  maxNumConcurrentTasks requests the SchedulerBackend for the maximum number of tasks that can be launched concurrently (with the given ResourceProfile).

                                                                                                                                                                                                                                                                                                                                                                                                                                                  maxNumConcurrentTasks is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • DAGScheduler is requested to checkBarrierStageWithNumSlots
                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#withScope","title":"withScope
                                                                                                                                                                                                                                                                                                                                                                                                                                                  withScope[U](\n  body: => U): U\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  withScope withScope with this SparkContext.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                  withScope is used for most (if not all) SparkContext API operators.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#getPreferredLocs","title":"Finding Preferred Locations for RDD Partition
                                                                                                                                                                                                                                                                                                                                                                                                                                                  getPreferredLocs(\n  rdd: RDD[_],\n  partition: Int): Seq[TaskLocation]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  getPreferredLocs requests the DAGScheduler for the preferred locations of the given partition (of the given RDD).

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Preferred locations of a RDD partition are also referred to as placement preferences or locality preferences.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  getPreferredLocs is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • CoalescedRDDPartition is requested to localFraction
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • DefaultPartitionCoalescer is requested to currPrefLocs
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • PartitionerAwareUnionRDD is requested to currPrefLocs
                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkContext/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Enable ALL logging level for org.apache.spark.SparkContext logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  logger.SparkContext.name = org.apache.spark.SparkContext\nlogger.SparkContext.level = all\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"SparkCoreErrors/","title":"SparkCoreErrors","text":""},{"location":"SparkCoreErrors/#numPartitionsGreaterThanMaxNumConcurrentTasksError","title":"numPartitionsGreaterThanMaxNumConcurrentTasksError","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                  numPartitionsGreaterThanMaxNumConcurrentTasksError(\n  numPartitions: Int,\n  maxNumConcurrentTasks: Int): Throwable\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                  numPartitionsGreaterThanMaxNumConcurrentTasksError creates a BarrierJobSlotsNumberCheckFailed with the given input arguments.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  numPartitionsGreaterThanMaxNumConcurrentTasksError is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • DAGScheduler is requested to checkBarrierStageWithNumSlots
                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"SparkEnv/","title":"SparkEnv","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkEnv is a handle to Spark Execution Environment with the core services of Apache Spark (that interact with each other to establish a distributed computing platform for a Spark application).

                                                                                                                                                                                                                                                                                                                                                                                                                                                  There are two separate SparkEnvs of the driver and executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","tags":["DeveloperApi"]},{"location":"SparkEnv/#core-services","title":"Core Services Property Service blockManager BlockManager broadcastManager BroadcastManager closureSerializer Serializer conf SparkConf mapOutputTracker MapOutputTracker memoryManager MemoryManager metricsSystem MetricsSystem outputCommitCoordinator OutputCommitCoordinator rpcEnv RpcEnv securityManager SecurityManager serializer Serializer serializerManager SerializerManager shuffleManager ShuffleManager","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#creating-instance","title":"Creating Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkEnv takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • RpcEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Serializer
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Serializer
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SerializerManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • MapOutputTracker
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ShuffleManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BroadcastManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SecurityManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • MetricsSystem
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • MemoryManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • OutputCommitCoordinator
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkEnv is created using create utility.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#drivers-temporary-directory","title":"Driver's Temporary Directory
                                                                                                                                                                                                                                                                                                                                                                                                                                                    driverTmpDir: Option[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    SparkEnv defines driverTmpDir internal registry for the driver to be used as the root directory of files added using SparkContext.addFile.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    driverTmpDir is undefined initially and is defined for the driver only when SparkEnv utility is used to create a \"base\" SparkEnv.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#demo","title":"Demo
                                                                                                                                                                                                                                                                                                                                                                                                                                                    import org.apache.spark.SparkEnv\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                    // :pa -raw\n// BEGIN\npackage org.apache.spark\nobject BypassPrivateSpark {\n  def driverTmpDir(sparkEnv: SparkEnv) = {\n    sparkEnv.driverTmpDir\n  }\n}\n// END\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                    val driverTmpDir = org.apache.spark.BypassPrivateSpark.driverTmpDir(SparkEnv.get).get\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    The above is equivalent to the following snippet.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    import org.apache.spark.SparkFiles\nSparkFiles.getRootDirectory\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#creating-sparkenv-for-driver","title":"Creating SparkEnv for Driver
                                                                                                                                                                                                                                                                                                                                                                                                                                                    createDriverEnv(\n  conf: SparkConf,\n  isLocal: Boolean,\n  listenerBus: LiveListenerBus,\n  numCores: Int,\n  mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    createDriverEnv creates a SparkEnv execution environment for the driver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    createDriverEnv accepts an instance of SparkConf, whether it runs in local mode or not, scheduler:LiveListenerBus.md[], the number of cores to use for execution in local mode or 0 otherwise, and a OutputCommitCoordinator (default: none).

                                                                                                                                                                                                                                                                                                                                                                                                                                                    createDriverEnv ensures that spark-driver.md#spark_driver_host[spark.driver.host] and spark-driver.md#spark_driver_port[spark.driver.port] settings are defined.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    It then passes the call straight on to the <> (with driver executor id, isDriver enabled, and the input parameters).

                                                                                                                                                                                                                                                                                                                                                                                                                                                    createDriverEnv is used when SparkContext is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#creating-sparkenv-for-executor","title":"Creating SparkEnv for Executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                    createExecutorEnv(\n  conf: SparkConf,\n  executorId: String,\n  hostname: String,\n  numCores: Int,\n  ioEncryptionKey: Option[Array[Byte]],\n  isLocal: Boolean): SparkEnv\ncreateExecutorEnv(\n  conf: SparkConf,\n  executorId: String,\n  bindAddress: String,\n  hostname: String,\n  numCores: Int,\n  ioEncryptionKey: Option[Array[Byte]],\n  isLocal: Boolean): SparkEnv\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    createExecutorEnv creates an executor's (execution) environment that is the Spark execution environment for an executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    createExecutorEnv simply <> (passing in all the input parameters) and <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    NOTE: The number of cores numCores is configured using --cores command-line option of CoarseGrainedExecutorBackend and is specific to a cluster manager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    createExecutorEnv is used when CoarseGrainedExecutorBackend utility is requested to run.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#creating-base-sparkenv","title":"Creating \"Base\" SparkEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                    create(\n  conf: SparkConf,\n  executorId: String,\n  bindAddress: String,\n  advertiseAddress: String,\n  port: Option[Int],\n  isLocal: Boolean,\n  numUsableCores: Int,\n  ioEncryptionKey: Option[Array[Byte]],\n  listenerBus: LiveListenerBus = null,\n  mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    create creates the \"base\" SparkEnv (that is common across the driver and executors).

                                                                                                                                                                                                                                                                                                                                                                                                                                                    create creates a RpcEnv as sparkDriver on the driver and sparkExecutor on executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    create creates a Serializer (based on spark.serializer configuration property). create prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Using serializer: [serializer]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    create creates a SerializerManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    create creates a JavaSerializer as the closure serializer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    creates creates a BroadcastManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    creates creates a MapOutputTrackerMaster (on the driver) or a MapOutputTrackerWorker (on executors). creates registers or looks up a MapOutputTrackerMasterEndpoint under the name of MapOutputTracker. creates prints out the following INFO message to the logs (on the driver only):

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Registering MapOutputTracker\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    creates creates a ShuffleManager (based on spark.shuffle.manager configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                    create creates a UnifiedMemoryManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    With spark.shuffle.service.enabled configuration property enabled, create creates an ExternalBlockStoreClient.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    create creates a BlockManagerMaster.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    create creates a NettyBlockTransferService.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    create creates a BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    create creates a MetricsSystem.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    create creates a OutputCommitCoordinator and registers or looks up a OutputCommitCoordinatorEndpoint under the name of OutputCommitCoordinator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    create creates a SparkEnv (with all the services \"stitched\" together).

                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Enable ALL logging level for org.apache.spark.SparkEnv logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    log4j.logger.org.apache.spark.SparkEnv=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"SparkFiles/","title":"SparkFiles","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                    SparkFiles is an utility to work with files added using SparkContext.addFile.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"SparkFiles/#absolute-path-of-added-file","title":"Absolute Path of Added File
                                                                                                                                                                                                                                                                                                                                                                                                                                                    get(\n  filename: String): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    get gets the absolute path of the given file in the root directory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkFiles/#root-directory","title":"Root Directory
                                                                                                                                                                                                                                                                                                                                                                                                                                                    getRootDirectory(): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    getRootDirectory requests the current SparkEnv for driverTmpDir (if defined) or defaults to the current directory (.).

                                                                                                                                                                                                                                                                                                                                                                                                                                                    getRootDirectory\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkContext is requested to addFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Executor is requested to updateDependencies
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkFiles utility is requested to get the absolute path of a file
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkHadoopWriter/","title":"SparkHadoopWriter Utility","text":""},{"location":"SparkHadoopWriter/#writing-key-value-rdd-out-as-hadoop-outputformat","title":"Writing Key-Value RDD Out (As Hadoop OutputFormat)
                                                                                                                                                                                                                                                                                                                                                                                                                                                    write[K, V: ClassTag](\n  rdd: RDD[(K, V)],\n  config: HadoopWriteConfigUtil[K, V]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    write runs a Spark job to write out partition records (for all partitions of the given key-value RDD) with the given HadoopWriteConfigUtil and a HadoopMapReduceCommitProtocol committer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    The number of writer tasks (parallelism) is the number of the partitions in the given key-value RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkHadoopWriter/#internals","title":"Internals

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Internally, write uses the id of the given RDD as the commitJobId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    write creates a jobTrackerId with the current date.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    write requests the given HadoopWriteConfigUtil to create a Hadoop JobContext (for the jobTrackerId and commitJobId).

                                                                                                                                                                                                                                                                                                                                                                                                                                                    write requests the given HadoopWriteConfigUtil to initOutputFormat with the Hadoop JobContext.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    write requests the given HadoopWriteConfigUtil to assertConf.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    write requests the given HadoopWriteConfigUtil to create a HadoopMapReduceCommitProtocol committer for the commitJobId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    write requests the HadoopMapReduceCommitProtocol to setupJob (with the jobContext).

                                                                                                                                                                                                                                                                                                                                                                                                                                                    write uses the SparkContext (of the given RDD) to run a Spark job asynchronously for the given RDD with the executeTask partition function.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    In the end, write requests the HadoopMapReduceCommitProtocol to commit the job and prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Job [getJobID] committed.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkHadoopWriter/#throwables","title":"Throwables

                                                                                                                                                                                                                                                                                                                                                                                                                                                    In case of any Throwable, write prints out the following ERROR message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Aborting job [getJobID].\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    write requests the HadoopMapReduceCommitProtocol to abort the job and throws a SparkException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Job aborted.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkHadoopWriter/#usage","title":"Usage

                                                                                                                                                                                                                                                                                                                                                                                                                                                    write\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • PairRDDFunctions.saveAsNewAPIHadoopDataset
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • PairRDDFunctions.saveAsHadoopDataset
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkHadoopWriter/#writing-rdd-partition","title":"Writing RDD Partition
                                                                                                                                                                                                                                                                                                                                                                                                                                                    executeTask[K, V: ClassTag](\n  context: TaskContext,\n  config: HadoopWriteConfigUtil[K, V],\n  jobTrackerId: String,\n  commitJobId: Int,\n  sparkPartitionId: Int,\n  sparkAttemptNumber: Int,\n  committer: FileCommitProtocol,\n  iterator: Iterator[(K, V)]): TaskCommitMessage\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Fixme

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Review Me

                                                                                                                                                                                                                                                                                                                                                                                                                                                    executeTask requests the given HadoopWriteConfigUtil to create a TaskAttemptContext.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    executeTask requests the given FileCommitProtocol to set up a task with the TaskAttemptContext.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    executeTask requests the given HadoopWriteConfigUtil to initWriter (with the TaskAttemptContext and the given sparkPartitionId).

                                                                                                                                                                                                                                                                                                                                                                                                                                                    executeTask initHadoopOutputMetrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    executeTask writes all rows of the RDD partition (from the given Iterator[(K, V)]). executeTask requests the given HadoopWriteConfigUtil to write. In the end, executeTask requests the given HadoopWriteConfigUtil to closeWriter and the given FileCommitProtocol to commit the task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    executeTask updates metrics about writing data to external systems (bytesWritten and recordsWritten) every few records and at the end.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    In case of any errors, executeTask requests the given HadoopWriteConfigUtil to closeWriter and the given FileCommitProtocol to abort the task. In the end, executeTask prints out the following ERROR message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Task [taskAttemptID] aborted.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    executeTask is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkHadoopWriter utility is used to write
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkHadoopWriter/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Enable ALL logging level for org.apache.spark.internal.io.SparkHadoopWriter logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    log4j.logger.org.apache.spark.internal.io.SparkHadoopWriter=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListener/","title":"SparkListener","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                    SparkListener\u00a0is an extension of the SparkListenerInterface abstraction for event listeners with a no-op implementation for callback methods.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","tags":["DeveloperApi"]},{"location":"SparkListener/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BarrierCoordinator
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkSession (Spark SQL)
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • AppListingListener (Spark History Server)
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • AppStatusListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BasicEventFilterBuilder (Spark History Server)
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • EventLoggingListener (Spark History Server)
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExecutionListenerBus
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExecutorAllocationListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExecutorMonitor
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • HeartbeatReceiver
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • HiveThriftServer2Listener (Spark Thrift Server)
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SpillListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SQLAppStatusListener (Spark SQL)
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SQLEventFilterBuilder
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • StatsReportListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • StreamingQueryListenerBus (Spark Structured Streaming)
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","tags":["DeveloperApi"]},{"location":"SparkListenerBus/","title":"SparkListenerBus","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                    SparkListenerBus\u00a0is an extension of the ListenerBus abstraction for event buses for SparkListenerInterfaces to be notified about SparkListenerEvents.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"SparkListenerBus/#posting-event-to-sparklistener","title":"Posting Event to SparkListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                    doPostEvent(\n  listener: SparkListenerInterface,\n  event: SparkListenerEvent): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    doPostEvent\u00a0is part of the ListenerBus abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    doPostEvent notifies the given SparkListenerInterface about the SparkListenerEvent.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    doPostEvent calls an event-specific method of SparkListenerInterface or falls back to onOtherEvent.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerBus/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • AsyncEventQueue
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ReplayListenerBus
                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"SparkListenerEvent/","title":"SparkListenerEvent","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                    SparkListenerEvent is an abstraction of scheduling events.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"SparkListenerEvent/#dispatching-sparklistenerevents","title":"Dispatching SparkListenerEvents","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                    SparkListenerBus in general (and AsyncEventQueue are event buses used to dispatch SparkListenerEvents to registered SparkListeners.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    LiveListenerBus is an event bus to dispatch SparkListenerEvents to registered SparkListeners.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"SparkListenerEvent/#spark-history-server","title":"Spark History Server","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Once logged, Spark History Server uses JsonProtocol utility to sparkEventFromJson.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"SparkListenerEvent/#contract","title":"Contract","text":""},{"location":"SparkListenerEvent/#logevent","title":"logEvent
                                                                                                                                                                                                                                                                                                                                                                                                                                                    logEvent: Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    logEvent controls whether EventLoggingListener should save the event to an event log.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                    logEvent\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • EventLoggingListener is requested to handle \"other\" events
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerEvent/#implementations","title":"Implementations","text":""},{"location":"SparkListenerEvent/#sparklistenerapplicationend","title":"SparkListenerApplicationEnd","text":""},{"location":"SparkListenerEvent/#sparklistenerapplicationstart","title":"SparkListenerApplicationStart","text":""},{"location":"SparkListenerEvent/#sparklistenerblockmanageradded","title":"SparkListenerBlockManagerAdded","text":""},{"location":"SparkListenerEvent/#sparklistenerblockmanagerremoved","title":"SparkListenerBlockManagerRemoved","text":""},{"location":"SparkListenerEvent/#sparklistenerblockupdated","title":"SparkListenerBlockUpdated","text":""},{"location":"SparkListenerEvent/#sparklistenerenvironmentupdate","title":"SparkListenerEnvironmentUpdate","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutoradded","title":"SparkListenerExecutorAdded","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutorblacklisted","title":"SparkListenerExecutorBlacklisted","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutorblacklistedforstage","title":"SparkListenerExecutorBlacklistedForStage","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutormetricsupdate","title":"SparkListenerExecutorMetricsUpdate","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutorremoved","title":"SparkListenerExecutorRemoved","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutorunblacklisted","title":"SparkListenerExecutorUnblacklisted","text":""},{"location":"SparkListenerEvent/#sparklistenerjobend","title":"SparkListenerJobEnd","text":""},{"location":"SparkListenerEvent/#sparklistenerjobstart","title":"SparkListenerJobStart","text":""},{"location":"SparkListenerEvent/#sparklistenerlogstart","title":"SparkListenerLogStart","text":""},{"location":"SparkListenerEvent/#sparklistenernodeblacklisted","title":"SparkListenerNodeBlacklisted","text":""},{"location":"SparkListenerEvent/#sparklistenernodeblacklistedforstage","title":"SparkListenerNodeBlacklistedForStage","text":""},{"location":"SparkListenerEvent/#sparklistenernodeunblacklisted","title":"SparkListenerNodeUnblacklisted","text":""},{"location":"SparkListenerEvent/#sparklistenerspeculativetasksubmitted","title":"SparkListenerSpeculativeTaskSubmitted","text":""},{"location":"SparkListenerEvent/#sparklistenerstagecompleted","title":"SparkListenerStageCompleted","text":""},{"location":"SparkListenerEvent/#sparklistenerstageexecutormetrics","title":"SparkListenerStageExecutorMetrics","text":""},{"location":"SparkListenerEvent/#sparklistenerstagesubmitted","title":"SparkListenerStageSubmitted","text":""},{"location":"SparkListenerEvent/#sparklistenertaskend","title":"SparkListenerTaskEnd

                                                                                                                                                                                                                                                                                                                                                                                                                                                    SparkListenerTaskEnd

                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerEvent/#sparklistenertaskgettingresult","title":"SparkListenerTaskGettingResult","text":""},{"location":"SparkListenerEvent/#sparklistenertaskstart","title":"SparkListenerTaskStart","text":""},{"location":"SparkListenerEvent/#sparklistenerunpersistrdd","title":"SparkListenerUnpersistRDD","text":""},{"location":"SparkListenerInterface/","title":"SparkListenerInterface","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                    SparkListenerInterface is an abstraction of event listeners (that SparkListenerBus notifies about scheduling events).

                                                                                                                                                                                                                                                                                                                                                                                                                                                    SparkListenerInterface is a way to intercept scheduling events from the Spark Scheduler that are emitted over the course of execution of a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    SparkListenerInterface is used heavily to manage communication between internal components in the distributed environment for a Spark application (e.g. web UI, event persistence for History Server, dynamic allocation of executors, keeping track of executors).

                                                                                                                                                                                                                                                                                                                                                                                                                                                    SparkListenerInterface can be registered in a Spark application using SparkContext.addSparkListener method or spark.extraListeners configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Tip

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Enable INFO logging level for org.apache.spark.SparkContext logger to see what and when custom Spark listeners are registered.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"SparkListenerInterface/#onapplicationend","title":"onApplicationEnd
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onApplicationEnd(\n  applicationEnd: SparkListenerApplicationEnd): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerApplicationEnd event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onapplicationstart","title":"onApplicationStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onApplicationStart(\n  applicationStart: SparkListenerApplicationStart): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerApplicationStart event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onblockmanageradded","title":"onBlockManagerAdded
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onBlockManagerAdded(\n  blockManagerAdded: SparkListenerBlockManagerAdded): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerBlockManagerAdded event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onblockmanagerremoved","title":"onBlockManagerRemoved
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onBlockManagerRemoved(\n  blockManagerRemoved: SparkListenerBlockManagerRemoved): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerBlockManagerRemoved event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onblockupdated","title":"onBlockUpdated
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onBlockUpdated(\n  blockUpdated: SparkListenerBlockUpdated): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerBlockUpdated event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onenvironmentupdate","title":"onEnvironmentUpdate
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onEnvironmentUpdate(\n  environmentUpdate: SparkListenerEnvironmentUpdate): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerEnvironmentUpdate event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onexecutoradded","title":"onExecutorAdded
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onExecutorAdded(\n  executorAdded: SparkListenerExecutorAdded): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerExecutorAdded event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onexecutorblacklisted","title":"onExecutorBlacklisted
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onExecutorBlacklisted(\n  executorBlacklisted: SparkListenerExecutorBlacklisted): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerExecutorBlacklisted event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onexecutorblacklistedforstage","title":"onExecutorBlacklistedForStage
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onExecutorBlacklistedForStage(\n  executorBlacklistedForStage: SparkListenerExecutorBlacklistedForStage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerExecutorBlacklistedForStage event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onexecutormetricsupdate","title":"onExecutorMetricsUpdate
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onExecutorMetricsUpdate(\n  executorMetricsUpdate: SparkListenerExecutorMetricsUpdate): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerExecutorMetricsUpdate event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onexecutorremoved","title":"onExecutorRemoved
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onExecutorRemoved(\n  executorRemoved: SparkListenerExecutorRemoved): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerExecutorRemoved event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onexecutorunblacklisted","title":"onExecutorUnblacklisted
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onExecutorUnblacklisted(\n  executorUnblacklisted: SparkListenerExecutorUnblacklisted): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerExecutorUnblacklisted event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onjobend","title":"onJobEnd
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onJobEnd(\n  jobEnd: SparkListenerJobEnd): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerJobEnd event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onjobstart","title":"onJobStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onJobStart(\n  jobStart: SparkListenerJobStart): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerJobStart event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onnodeblacklisted","title":"onNodeBlacklisted
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onNodeBlacklisted(\n  nodeBlacklisted: SparkListenerNodeBlacklisted): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerNodeBlacklisted event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onnodeblacklistedforstage","title":"onNodeBlacklistedForStage
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onNodeBlacklistedForStage(\n  nodeBlacklistedForStage: SparkListenerNodeBlacklistedForStage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerNodeBlacklistedForStage event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onnodeunblacklisted","title":"onNodeUnblacklisted
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onNodeUnblacklisted(\n  nodeUnblacklisted: SparkListenerNodeUnblacklisted): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerNodeUnblacklisted event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onotherevent","title":"onOtherEvent
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onOtherEvent(\n  event: SparkListenerEvent): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a custom SparkListenerEvent
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onspeculativetasksubmitted","title":"onSpeculativeTaskSubmitted
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onSpeculativeTaskSubmitted(\n  speculativeTask: SparkListenerSpeculativeTaskSubmitted): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerSpeculativeTaskSubmitted event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onstagecompleted","title":"onStageCompleted
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onStageCompleted(\n  stageCompleted: SparkListenerStageCompleted): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerStageCompleted event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onstageexecutormetrics","title":"onStageExecutorMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onStageExecutorMetrics(\n  executorMetrics: SparkListenerStageExecutorMetrics): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerStageExecutorMetrics event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onstagesubmitted","title":"onStageSubmitted
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onStageSubmitted(\n  stageSubmitted: SparkListenerStageSubmitted): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerStageSubmitted event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#ontaskend","title":"onTaskEnd
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onTaskEnd(\n  taskEnd: SparkListenerTaskEnd): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerTaskEnd event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#ontaskgettingresult","title":"onTaskGettingResult
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onTaskGettingResult(\n  taskGettingResult: SparkListenerTaskGettingResult): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerTaskGettingResult event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#ontaskstart","title":"onTaskStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onTaskStart(\n  taskStart: SparkListenerTaskStart): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerTaskStart event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#onunpersistrdd","title":"onUnpersistRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                    onUnpersistRDD(\n  unpersistRDD: SparkListenerUnpersistRDD): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListenerBus is requested to post a SparkListenerUnpersistRDD event
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerInterface/#implementations","title":"Implementations
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • EventFilterBuilder
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkFirehoseListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"SparkListenerTaskEnd/","title":"SparkListenerTaskEnd","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                    SparkListenerTaskEnd is a SparkListenerEvent.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    SparkListenerTaskEnd is posted (and created) when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • DAGScheduler is requested to postTaskEnd

                                                                                                                                                                                                                                                                                                                                                                                                                                                    SparkListenerTaskEnd is intercepted using SparkListenerInterface.onTaskEnd

                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"SparkListenerTaskEnd/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                    SparkListenerTaskEnd takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Stage ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Stage Attempt ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Task Type
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • TaskEndReason
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • TaskInfo
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExecutorMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                    • TaskMetrics"},{"location":"SparkStatusTracker/","title":"SparkStatusTracker","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                      SparkStatusTracker is created for SparkContext for Spark developers to access the AppStatusStore and the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                      • All active job IDs
                                                                                                                                                                                                                                                                                                                                                                                                                                                      • All active stage IDs
                                                                                                                                                                                                                                                                                                                                                                                                                                                      • All known job IDs (and possibly limited to a particular job group)
                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SparkExecutorInfos of all known executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SparkJobInfo of a job ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SparkStageInfo of a stage ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"SparkStatusTracker/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                      SparkStatusTracker takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SparkContext (unused)
                                                                                                                                                                                                                                                                                                                                                                                                                                                      • AppStatusStore

                                                                                                                                                                                                                                                                                                                                                                                                                                                        SparkStatusTracker is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"SpillListener/","title":"SpillListener","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        SpillListener is a SparkListener that intercepts (listens to) the following events for detecting spills in jobs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • onTaskEnd
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • onStageCompleted

                                                                                                                                                                                                                                                                                                                                                                                                                                                        SpillListener is used for testing only.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"SpillListener/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        SpillListener takes no input arguments to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        SpillListener is created when TestUtils is requested to assertSpilled and assertNotSpilled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"SpillListener/#ontaskend-callback","title":"onTaskEnd Callback
                                                                                                                                                                                                                                                                                                                                                                                                                                                        onTaskEnd(\n  taskEnd: SparkListenerTaskEnd): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        onTaskEnd...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                        onTaskEnd is part of the SparkListener abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"SpillListener/#onstagecompleted-callback","title":"onStageCompleted Callback
                                                                                                                                                                                                                                                                                                                                                                                                                                                        onStageCompleted(\n  stageComplete: SparkListenerStageCompleted): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        onStageCompleted...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                        onStageCompleted is part of the SparkListener abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"StatsReportListener/","title":"StatsReportListener \u2014 Logging Summary Statistics","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        org.apache.spark.scheduler.StatsReportListener (see https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.StatsReportListener[the listener's scaladoc]) is a SparkListener.md[] that logs summary statistics when each stage completes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        StatsReportListener listens to SparkListenerTaskEnd and SparkListenerStageCompleted events and prints them out at INFO logging level.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","tags":["DeveloperApi"]},{"location":"StatsReportListener/#tip","title":"[TIP]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Enable INFO logging level for org.apache.spark.scheduler.StatsReportListener logger to see Spark events.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        log4j.logger.org.apache.spark.scheduler.StatsReportListener=INFO\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","tags":["DeveloperApi"]},{"location":"StatsReportListener/#refer-to-spark-loggingmdlogging","title":"Refer to spark-logging.md[Logging].","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[onStageCompleted]] Intercepting Stage Completed Events -- onStageCompleted Callback

                                                                                                                                                                                                                                                                                                                                                                                                                                                        CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[example]] Example

                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ ./bin/spark-shell -c spark.extraListeners=org.apache.spark.scheduler.StatsReportListener\n...\nINFO SparkContext: Registered listener org.apache.spark.scheduler.StatsReportListener\n...\n\nscala> spark.read.text(\"README.md\").count\n...\nINFO StatsReportListener: Finished stage: Stage(0, 0); Name: 'count at <console>:24'; Status: succeeded; numTasks: 1; Took: 212 msec\nINFO StatsReportListener: task runtime:(count: 1, mean: 198.000000, stdev: 0.000000, max: 198.000000, min: 198.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms\nINFO StatsReportListener: shuffle bytes written:(count: 1, mean: 59.000000, stdev: 0.000000, max: 59.000000, min: 59.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   59.0 B  59.0 B  59.0 B  59.0 B  59.0 B  59.0 B  59.0 B  59.0 B  59.0 B\nINFO StatsReportListener: fetch wait time:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms\nINFO StatsReportListener: remote bytes read:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B\nINFO StatsReportListener: task result size:(count: 1, mean: 1885.000000, stdev: 0.000000, max: 1885.000000, min: 1885.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B\nINFO StatsReportListener: executor (non-fetch) time pct: (count: 1, mean: 73.737374, stdev: 0.000000, max: 73.737374, min: 73.737374)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   74 %    74 %    74 %    74 %    74 %    74 %    74 %    74 %    74 %\nINFO StatsReportListener: fetch wait time pct: (count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:    0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %\nINFO StatsReportListener: other time pct: (count: 1, mean: 26.262626, stdev: 0.000000, max: 26.262626, min: 26.262626)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   26 %    26 %    26 %    26 %    26 %    26 %    26 %    26 %    26 %\nINFO StatsReportListener: Finished stage: Stage(1, 0); Name: 'count at <console>:24'; Status: succeeded; numTasks: 1; Took: 34 msec\nINFO StatsReportListener: task runtime:(count: 1, mean: 33.000000, stdev: 0.000000, max: 33.000000, min: 33.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms\nINFO StatsReportListener: shuffle bytes written:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B\nINFO StatsReportListener: fetch wait time:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms\nINFO StatsReportListener: remote bytes read:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B\nINFO StatsReportListener: task result size:(count: 1, mean: 1960.000000, stdev: 0.000000, max: 1960.000000, min: 1960.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B\nINFO StatsReportListener: executor (non-fetch) time pct: (count: 1, mean: 75.757576, stdev: 0.000000, max: 75.757576, min: 75.757576)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   76 %    76 %    76 %    76 %    76 %    76 %    76 %    76 %    76 %\nINFO StatsReportListener: fetch wait time pct: (count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:    0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %\nINFO StatsReportListener: other time pct: (count: 1, mean: 24.242424, stdev: 0.000000, max: 24.242424, min: 24.242424)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   24 %    24 %    24 %    24 %    24 %    24 %    24 %    24 %    24 %\nres0: Long = 99\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","tags":["DeveloperApi"]},{"location":"TaskCompletionListener/","title":"TaskCompletionListener","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskCompletionListener\u00a0is an extension of the EventListener (Java) abstraction for task listeners that can be notified on task completion.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","tags":["DeveloperApi"]},{"location":"TaskCompletionListener/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"TaskCompletionListener/#ontaskcompletion","title":"onTaskCompletion
                                                                                                                                                                                                                                                                                                                                                                                                                                                        onTaskCompletion(\n  context: TaskContext): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskContextImpl is requested to addTaskCompletionListener (and a task has already completed) and markTaskCompleted
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleFetchCompletionListener is requested to onComplete
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"TaskFailureListener/","title":"TaskFailureListener","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskFailureListener\u00a0is an extension of the EventListener (Java) abstraction for task listeners that can be notified on task failure.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","tags":["DeveloperApi"]},{"location":"TaskFailureListener/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"TaskFailureListener/#ontaskfailure","title":"onTaskFailure
                                                                                                                                                                                                                                                                                                                                                                                                                                                        onTaskFailure(\n  context: TaskContext,\n  error: Throwable): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskContextImpl is requested to addTaskFailureListener (and a task has already failed) and markTaskFailed
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"Utils/","title":"Utils Utility","text":""},{"location":"Utils/#getdynamicallocationinitialexecutors","title":"getDynamicAllocationInitialExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                        getDynamicAllocationInitialExecutors(\n  conf: SparkConf): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getDynamicAllocationInitialExecutors gives the maximum value of the following configuration properties (for the initial number of executors):

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • spark.dynamicAllocation.initialExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • spark.dynamicAllocation.minExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • spark.executor.instances

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getDynamicAllocationInitialExecutors prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Using initial executors = [initialExecutors],\nmax of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        With spark.dynamicAllocation.initialExecutors less than spark.dynamicAllocation.minExecutors, getDynamicAllocationInitialExecutors prints out the following WARN message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.dynamicAllocation.initialExecutors less than spark.dynamicAllocation.minExecutors is invalid,\nignoring its setting, please update your configs.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        With spark.executor.instances less than spark.dynamicAllocation.minExecutors, getDynamicAllocationInitialExecutors prints out the following WARN message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid,\nignoring its setting, please update your configs.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getDynamicAllocationInitialExecutors is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExecutorAllocationManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SchedulerBackendUtils utility is used to getInitialTargetExecutorNumber
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"Utils/#local-directories-for-scratch-space","title":"Local Directories for Scratch Space
                                                                                                                                                                                                                                                                                                                                                                                                                                                        getConfiguredLocalDirs(\n  conf: SparkConf): Array[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getConfiguredLocalDirs returns the local directories where Spark can write files to.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getConfiguredLocalDirs uses the given SparkConf to know if External Shuffle Service is enabled or not (based on spark.shuffle.service.enabled configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        When in a YARN container (CONTAINER_ID), getConfiguredLocalDirs uses LOCAL_DIRS environment variable for YARN-approved local directories.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        In non-YARN mode (or for the driver in yarn-client mode), getConfiguredLocalDirs checks the following environment variables (in order) and returns the value of the first found:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        1. SPARK_EXECUTOR_DIRS
                                                                                                                                                                                                                                                                                                                                                                                                                                                        2. SPARK_LOCAL_DIRS
                                                                                                                                                                                                                                                                                                                                                                                                                                                        3. MESOS_DIRECTORY (only when External Shuffle Service is not used)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The environment variables are a comma-separated list of local directory paths.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the end, when no earlier environment variables were found, getConfiguredLocalDirs uses spark.local.dir configuration property (with java.io.tmpdir System property as the default value).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getConfiguredLocalDirs is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DiskBlockManager is requested to createLocalDirs and createLocalDirsForMergedShuffleBlocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Utils utility is used to get a single random local root directory and create a spark directory in every local root directory
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"Utils/#random-local-directory-path","title":"Random Local Directory Path
                                                                                                                                                                                                                                                                                                                                                                                                                                                        getLocalDir(\n  conf: SparkConf): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getLocalDir takes a random directory path out of the configured local root directories

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getLocalDir throws an IOException if no local directory is defined:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Failed to get a temp directory under [[configuredLocalDirs]].\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getLocalDir is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkEnv utility is used to create a base SparkEnv for the driver
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Utils utility is used to fetchFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DriverLogger is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • RocksDBStateStoreProvider (Spark Structured Streaming) is requested for a RocksDB
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • PythonBroadcast (PySpark) is requested to readObject
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • AggregateInPandasExec (PySpark) is requested to doExecute
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • EvalPythonExec (PySpark) is requested to doExecute
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • WindowInPandasExec (PySpark) is requested to doExecute
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • PythonForeachWriter (PySpark) is requested for a UnsafeRowBuffer
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Client (Spark on YARN) is requested to prepareLocalResources and createConfArchive
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"Utils/#localrootdirs-registry","title":"localRootDirs Registry

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Utils utility uses localRootDirs internal registry so getOrCreateLocalRootDirsImpl is executed just once (when first requested).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        localRootDirs is available using getOrCreateLocalRootDirs method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getOrCreateLocalRootDirs(\n  conf: SparkConf): Array[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getOrCreateLocalRootDirs is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Utils is used to getLocalDir
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Worker (Spark Standalone) is requested to launch an executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"Utils/#creating-spark-directory-in-every-local-root-directory","title":"Creating spark Directory in Every Local Root Directory
                                                                                                                                                                                                                                                                                                                                                                                                                                                        getOrCreateLocalRootDirsImpl(\n  conf: SparkConf): Array[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getOrCreateLocalRootDirsImpl creates a spark-[randomUUID] directory under every root directory for local storage (and registers a shutdown hook to delete the directories at shutdown).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getOrCreateLocalRootDirsImpl prints out the following WARN message to the logs when there is a local root directories as a URI (with a scheme):

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The configured local directories are not expected to be URIs;\nhowever, got suspicious values [[uris]].\nPlease check your configured local directories.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"Utils/#local-uri-scheme","title":"Local URI Scheme

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Utils defines a local URI scheme for files that are locally available on worker nodes in the cluster.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The local URL scheme is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Utils is used to isLocalUri
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Client (Spark on YARN) is used
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"Utils/#islocaluri","title":"isLocalUri
                                                                                                                                                                                                                                                                                                                                                                                                                                                        isLocalUri(\n  uri: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        isLocalUri is true when the URI is a local: URI (the given uri starts with local: scheme).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        isLocalUri is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • FIXME
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"Utils/#getcurrentusername","title":"getCurrentUserName
                                                                                                                                                                                                                                                                                                                                                                                                                                                        getCurrentUserName(): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getCurrentUserName computes the user name who has started the SparkContext.md[SparkContext] instance.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: It is later available as SparkContext.md#sparkUser[SparkContext.sparkUser].

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Internally, it reads SparkContext.md#SPARK_USER[SPARK_USER] environment variable and, if not set, reverts to Hadoop Security API's UserGroupInformation.getCurrentUser().getShortUserName().

                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: It is another place where Spark relies on Hadoop API for its operation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"Utils/#localhostname","title":"localHostName
                                                                                                                                                                                                                                                                                                                                                                                                                                                        localHostName(): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        localHostName computes the local host name.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        It starts by checking SPARK_LOCAL_HOSTNAME environment variable for the value. If it is not defined, it uses SPARK_LOCAL_IP to find the name (using InetAddress.getByName). If it is not defined either, it calls InetAddress.getLocalHost for the name.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: Utils.localHostName is executed while SparkContext.md#creating-instance[SparkContext is created] and also to compute the default value of spark-driver.md#spark_driver_host[spark.driver.host Spark property].

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"Utils/#getuserjars","title":"getUserJars
                                                                                                                                                                                                                                                                                                                                                                                                                                                        getUserJars(\n  conf: SparkConf): Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getUserJars is the spark.jars configuration property with non-empty entries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getUserJars is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"Utils/#extracthostportfromsparkurl","title":"extractHostPortFromSparkUrl
                                                                                                                                                                                                                                                                                                                                                                                                                                                        extractHostPortFromSparkUrl(\n  sparkUrl: String): (String, Int)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        extractHostPortFromSparkUrl creates a Java URI with the input sparkUrl and takes the host and port parts.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        extractHostPortFromSparkUrl asserts that the input sparkURL uses spark scheme.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        extractHostPortFromSparkUrl throws a SparkException for unparseable spark URLs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Invalid master URL: [sparkUrl]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        extractHostPortFromSparkUrl is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • StandaloneSubmitRequestServlet is requested to buildDriverDescription
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • RpcAddress is requested to extract an RpcAddress from a Spark master URL
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"Utils/#isDynamicAllocationEnabled","title":"isDynamicAllocationEnabled
                                                                                                                                                                                                                                                                                                                                                                                                                                                        isDynamicAllocationEnabled(\n  conf: SparkConf): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        isDynamicAllocationEnabled checks whether Dynamic Allocation of Executors is enabled (true) or not (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        isDynamicAllocationEnabled is positive (true) when all the following hold:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        1. spark.dynamicAllocation.enabled configuration property is true
                                                                                                                                                                                                                                                                                                                                                                                                                                                        2. spark.master is non-local

                                                                                                                                                                                                                                                                                                                                                                                                                                                        isDynamicAllocationEnabled is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkContext is created (to start an ExecutorAllocationManager)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskResourceProfile is requested for custom executor resources
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ResourceProfileManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DAGScheduler is requested to checkBarrierStageWithDynamicAllocation
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskSchedulerImpl is requested to resourceOffers
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SchedulerBackendUtils is requested to getInitialTargetExecutorNumber
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • StandaloneSchedulerBackend (Spark Standalone) is requested to start (for reporting purposes)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExecutorPodsAllocator (Spark on Kubernetes) is created (maxPVCs)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ApplicationMaster (Spark on YARN) is created (maxNumExecutorFailures)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • YarnSchedulerBackend (Spark on YARN) is requested to getShufflePushMergerLocations
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"Utils/#checkandgetk8smasterurl","title":"checkAndGetK8sMasterUrl
                                                                                                                                                                                                                                                                                                                                                                                                                                                        checkAndGetK8sMasterUrl(\n  rawMasterURL: String): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        checkAndGetK8sMasterUrl...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                        checkAndGetK8sMasterUrl is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkSubmit is requested to prepareSubmitEnvironment (for Kubernetes cluster manager)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"Utils/#fetching-file","title":"Fetching File
                                                                                                                                                                                                                                                                                                                                                                                                                                                        fetchFile(\n  url: String,\n  targetDir: File,\n  conf: SparkConf,\n  securityMgr: SecurityManager,\n  hadoopConf: Configuration,\n  timestamp: Long,\n  useCache: Boolean): File\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        fetchFile...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                        fetchFile is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkContext is requested to SparkContext.md#addFile[addFile]

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Executor is requested to executor:Executor.md#updateDependencies[updateDependencies]

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Spark Standalone's DriverRunner is requested to downloadUserJar

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"Utils/#ispushbasedshuffleenabled","title":"isPushBasedShuffleEnabled
                                                                                                                                                                                                                                                                                                                                                                                                                                                        isPushBasedShuffleEnabled(\n  conf: SparkConf,\n  isDriver: Boolean,\n  checkSerializer: Boolean = true): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        isPushBasedShuffleEnabled takes the value of spark.shuffle.push.enabled configuration property (from the given SparkConf).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        If false, isPushBasedShuffleEnabled does nothing and returns false as well.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Otherwise, isPushBasedShuffleEnabled returns whether it is even possible to use push-based shuffle or not based on the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        1. External Shuffle Service is used (based on spark.shuffle.service.enabled that should be true)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        2. spark.master is yarn
                                                                                                                                                                                                                                                                                                                                                                                                                                                        3. (only with checkSerializer enabled) spark.serializer is a Serializer that supportsRelocationOfSerializedObjects
                                                                                                                                                                                                                                                                                                                                                                                                                                                        4. spark.io.encryption.enabled is false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        In case spark.shuffle.push.enabled configuration property is enabled but the above requirements did not hold, isPushBasedShuffleEnabled prints out the following WARN message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Push-based shuffle can only be enabled\nwhen the application is submitted to run in YARN mode,\nwith external shuffle service enabled, IO encryption disabled,\nand relocation of serialized objects supported.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        isPushBasedShuffleEnabled\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleDependency is requested to canShuffleMergeBeEnabled
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MapOutputTrackerMaster is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MapOutputTrackerWorker is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DAGScheduler is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleBlockPusher utility is used to create a BLOCK_PUSHER_POOL thread pool
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to initialize and registerWithExternalShuffleServer
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManagerMasterEndpoint is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DiskBlockManager is requested to createLocalDirsForMergedShuffleBlocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"Utils/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Enable ALL logging level for org.apache.spark.util.Utils logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        log4j.logger.org.apache.spark.util.Utils=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"architecture/","title":"Architecture","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        = Spark Architecture

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark uses a master/worker architecture. There is a spark-driver.md[driver] that talks to a single coordinator called spark-master.md[master] that manages spark-workers.md[workers] in which executor:Executor.md[executors] run.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        .Spark architecture image::driver-sparkcontext-clustermanager-workers-executors.png[align=\"center\"]

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The driver and the executors run in their own Java processes. You can run them all on the same (horizontal cluster) or separate machines (vertical cluster) or in a mixed machine configuration.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        .Spark architecture in detail image::sparkapp-sparkcontext-master-slaves.png[align=\"center\"]

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Physical machines are called hosts or nodes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"configuration-properties/","title":"Configuration Properties","text":""},{"location":"configuration-properties/#sparkappid","title":"spark.app.id

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Unique identifier of a Spark application that Spark uses to uniquely identify metric sources.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: TaskScheduler.applicationId()

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Set when SparkContext is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkbroadcastblocksize","title":"spark.broadcast.blockSize

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The size of each piece of a block (in kB unless the unit is specified)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 4m

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Too large a value decreases parallelism during broadcast (makes it slower); however, if it is too small, BlockManager might take a performance hit

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TorrentBroadcast is requested to setConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkbroadcastcompress","title":"spark.broadcast.compress

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Controls broadcast variable compression (before sending them over the wire)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Generally a good idea. Compression will use spark.io.compression.codec

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TorrentBroadcast is requested to setConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SerializerManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.buffer.pageSize","title":"spark.buffer.pageSize

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.buffer.pageSize

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The amount of memory used per page (in bytes)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: (undefined)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MemoryManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkcleanerreferencetracking","title":"spark.cleaner.referenceTracking

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Controls whether to enable ContextCleaner

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkdiskstoresubdirectories","title":"spark.diskStore.subDirectories

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Number of subdirectories inside each path listed in spark.local.dir for hashing block files into.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 64

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used by BlockManager and DiskBlockManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkdriverhost","title":"spark.driver.host

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Address of the driver (endpoints)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: Utils.localCanonicalHostName

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkdriverlogallowerasurecoding","title":"spark.driver.log.allowErasureCoding

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DfsAsyncWriter is requested to init
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkdriverlogdfsdir","title":"spark.driver.log.dfsDir

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The directory on a Hadoop DFS-compliant file system where DriverLogger copies driver logs to

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: (undefined)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • FsHistoryProvider is requested to startPolling (and cleanDriverLogs)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DfsAsyncWriter is requested to init
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DriverLogger utility is used to create a DriverLogger (for a SparkContext)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkdriverlogpersisttodfsenabled","title":"spark.driver.log.persistToDfs.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Enables DriverLogger

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DriverLogger utility is used to create a DriverLogger (for a SparkContext)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkdrivermaxresultsize","title":"spark.driver.maxResultSize

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Maximum size of task results (in bytes)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 1g

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskRunner is requested to run a task (and decide on the type of a serialized task result)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskSetManager is requested to check available memory for task results

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkdriverport","title":"spark.driver.port

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Port of the driver (endpoints)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 0

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkexecutorcores","title":"spark.executor.cores

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Number of CPU cores for Executor

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 1

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkexecutorheartbeatmaxfailures","title":"spark.executor.heartbeat.maxFailures

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Number of times an Executor tries sending heartbeats to the driver before it gives up and exits (with exit code 56).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 60

                                                                                                                                                                                                                                                                                                                                                                                                                                                        For example, with max failures 60 (the default) and spark.executor.heartbeatInterval 10s, then Executor will try to send heartbeats for up to 600s (10 minutes).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Executor is created (and reportHeartBeat)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkexecutorheartbeatinterval","title":"spark.executor.heartbeatInterval

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Interval between Executor heartbeats (to the driver)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 10s

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Executor is created and requested to reportHeartBeat
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • HeartbeatReceiver is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkexecutorid","title":"spark.executor.id

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: (undefined)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkexecutorinstances","title":"spark.executor.instances

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Number of executors to use

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: (undefined)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkexecutormemory","title":"spark.executor.memory

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Amount of memory to use for an Executor

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 1g

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Equivalent to SPARK_EXECUTOR_MEMORY environment variable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkexecutormemoryoverhead","title":"spark.executor.memoryOverhead

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The amount of non-heap memory (in MiB) to be allocated per executor

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ResourceProfile is requested for the default executor resources
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Client (Spark on YARN) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkexecutormetricsfilesystemschemes","title":"spark.executor.metrics.fileSystemSchemes

                                                                                                                                                                                                                                                                                                                                                                                                                                                        A comma-separated list of the file system schemes to report in executor metrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: file,hdfs

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkexecutormetricspollinginterval","title":"spark.executor.metrics.pollingInterval

                                                                                                                                                                                                                                                                                                                                                                                                                                                        How often to collect executor metrics (in ms):

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • 0 - the polling is done on executor heartbeats
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • A positive number - the polling is done at this interval

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 0

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Executor is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkexecutoruserclasspathfirst","title":"spark.executor.userClassPathFirst

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Controls whether to load classes in user-defined jars before those in Spark jars

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • CoarseGrainedExecutorBackend is requested to create a ClassLoader
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Executor is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Client utility (Spark on YARN) is used to isUserClassPathFirst
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkextralisteners","title":"spark.extraListeners

                                                                                                                                                                                                                                                                                                                                                                                                                                                        A comma-separated list of fully-qualified class names of SparkListeners (to be registered when SparkContext is created)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: (empty)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkfiletransferto","title":"spark.file.transferTo

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Controls whether to use Java FileChannels (Java NIO) for copying data between two Java FileInputStreams to improve copy performance

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BypassMergeSortShuffleWriter and UnsafeShuffleWriter are created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkfiles","title":"spark.files

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The files to be added to a Spark application (that can be defined directly as a configuration property or indirectly using --files option of spark-submit script)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: (empty)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkioencryptionenabled","title":"spark.io.encryption.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Controls local disk I/O encryption

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkEnv utility is used to create a SparkEnv for the driver (to create a IO encryption key)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockStoreShuffleReader is requested to read combined records (and fetchContinuousBlocksInBatch)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkjars","title":"spark.jars

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: (empty)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkkryopool","title":"spark.kryo.pool

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • KryoSerializer is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkkryounsafe","title":"spark.kryo.unsafe

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Whether KryoSerializer should use Unsafe-based IO for serialization

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparklocaldir","title":"spark.local.dir

                                                                                                                                                                                                                                                                                                                                                                                                                                                        A comma-separated list of directory paths for \"scratch\" space (a temporary storage for map output files, RDDs that get stored on disk, etc.). It is recommended to use paths on fast local disks in your system (e.g. SSDs).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: java.io.tmpdir System property

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparklocalitywait","title":"spark.locality.wait

                                                                                                                                                                                                                                                                                                                                                                                                                                                        How long to wait until an executor is available for locality-aware delay scheduling (for PROCESS_LOCAL, NODE_LOCAL, and RACK_LOCAL TaskLocalities) unless locality-specific setting is set (i.e., spark.locality.wait.process, spark.locality.wait.node, and spark.locality.wait.rack, respectively)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 3s

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparklocalitywaitlegacyresetontasklaunch","title":"spark.locality.wait.legacyResetOnTaskLaunch

                                                                                                                                                                                                                                                                                                                                                                                                                                                        (internal) Whether to use the legacy behavior of locality wait, which resets the delay timer anytime a task is scheduled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskSchedulerImpl is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskSetManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparklocalitywaitnode","title":"spark.locality.wait.node

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Scheduling delay for TaskLocality.NODE_LOCAL

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: spark.locality.wait

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskSetManager is requested for the locality wait (of TaskLocality.NODE_LOCAL)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparklocalitywaitprocess","title":"spark.locality.wait.process

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Scheduling delay for TaskLocality.PROCESS_LOCAL

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: spark.locality.wait

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskSetManager is requested for the locality wait (of TaskLocality.PROCESS_LOCAL)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparklocalitywaitrack","title":"spark.locality.wait.rack

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Scheduling delay for TaskLocality.RACK_LOCAL

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: spark.locality.wait

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskSetManager is requested for the locality wait (of TaskLocality.RACK_LOCAL)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparklogconf","title":"spark.logConf

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkloglineage","title":"spark.logLineage

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Enables printing out the RDD lineage graph (using RDD.toDebugString) when executing an action (and running a job)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkmaster","title":"spark.master

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Master URL of the cluster manager to connect the Spark application to

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkmemoryfraction","title":"spark.memory.fraction

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Fraction of JVM heap space used for execution and storage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 0.6

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The lower the more frequent spills and cached data eviction. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records. Leaving this at the default value is recommended.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkmemoryoffheapenabled","title":"spark.memory.offHeap.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Controls whether Tungsten memory will be allocated on the JVM heap (false) or off-heap (true / using sun.misc.Unsafe).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        When enabled, spark.memory.offHeap.size must be greater than 0.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MemoryManager is requested for tungstenMemoryMode
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkmemoryoffheapsize","title":"spark.memory.offHeap.size

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Maximum memory (in bytes) for off-heap memory allocation

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 0

                                                                                                                                                                                                                                                                                                                                                                                                                                                        This setting has no impact on heap memory usage, so if your executors' total memory consumption must fit within some hard limit then be sure to shrink your JVM heap size accordingly.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Must not be negative and be set to a positive value when spark.memory.offHeap.enabled is enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkmemorystoragefraction","title":"spark.memory.storageFraction

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Amount of storage memory immune to eviction, expressed as a fraction of the size of the region set aside by spark.memory.fraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 0.5

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The higher the less working memory may be available to execution and tasks may spill to disk more often. The default value is recommended.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Must be in [0,1)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • UnifiedMemoryManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MemoryManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparknetworkiopreferdirectbufs","title":"spark.network.io.preferDirectBufs

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparknetworkmaxremoteblocksizefetchtomem","title":"spark.network.maxRemoteBlockSizeFetchToMem

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Remote block will be fetched to disk when size of the block is above this threshold in bytes

                                                                                                                                                                                                                                                                                                                                                                                                                                                        This is to avoid a giant request takes too much memory. Note this configuration will affect both shuffle fetch and block manager remote block fetch.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        With an external shuffle service use at least 2.3.0

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 200m

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockStoreShuffleReader is requested to read combined records for a reduce task
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • NettyBlockTransferService is requested to uploadBlock
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to fetchRemoteManagedBuffer
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparknetworksharedbytebufallocatorsenabled","title":"spark.network.sharedByteBufAllocators.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparknetworktimeout","title":"spark.network.timeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Network timeout (in seconds) to use for RPC remote endpoint lookup

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 120s

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparknetworktimeoutinterval","title":"spark.network.timeoutInterval

                                                                                                                                                                                                                                                                                                                                                                                                                                                        (in millis)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: spark.storage.blockManagerTimeoutIntervalMs

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkrddcompress","title":"spark.rdd.compress

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Controls whether to compress RDD partitions when stored serialized

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkreducermaxblocksinflightperaddress","title":"spark.reducer.maxBlocksInFlightPerAddress

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Maximum number of remote blocks being fetched per reduce task from a given host port

                                                                                                                                                                                                                                                                                                                                                                                                                                                        When a large number of blocks are being requested from a given address in a single fetch or simultaneously, this could crash the serving executor or a Node Manager. This is especially useful to reduce the load on the Node Manager when external shuffle is enabled. You can mitigate the issue by setting it to a lower value.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: (unlimited)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockStoreShuffleReader is requested to read combined records for a reduce task
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkreducermaxreqsinflight","title":"spark.reducer.maxReqsInFlight

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Maximum number of remote requests to fetch blocks at any given point

                                                                                                                                                                                                                                                                                                                                                                                                                                                        When the number of hosts in the cluster increase, it might lead to very large number of inbound connections to one or more nodes, causing the workers to fail under load. By allowing it to limit the number of fetch requests, this scenario can be mitigated

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: (unlimited)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockStoreShuffleReader is requested to read combined records for a reduce task
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkreducermaxsizeinflight","title":"spark.reducer.maxSizeInFlight

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Maximum size of all map outputs to fetch simultaneously from each reduce task (in MiB unless otherwise specified)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Since each output requires us to create a buffer to receive it, this represents a fixed memory overhead per reduce task, so keep it small unless you have a large amount of memory

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 48m

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockStoreShuffleReader is requested to read combined records for a reduce task
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkreplclassuri","title":"spark.repl.class.uri

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Controls whether to compress RDD partitions when stored serialized

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkrpclookuptimeout","title":"spark.rpc.lookupTimeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default Endpoint Lookup Timeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 120s

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkrpcmessagemaxsize","title":"spark.rpc.message.maxSize

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Maximum allowed message size for RPC communication (in MB unless specified)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 128

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Must be below 2047MB (Int.MaxValue / 1024 / 1024)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • CoarseGrainedSchedulerBackend is requested to launch tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • RpcUtils is requested for the maximum message size
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Executor is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • MapOutputTrackerMaster is created (and makes sure that spark.shuffle.mapOutput.minSizeForBroadcast is below the threshold)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkscheduler","title":"spark.scheduler","text":""},{"location":"configuration-properties/#spark.scheduler.barrier.maxConcurrentTasksCheck.interval","title":"barrier.maxConcurrentTasksCheck.interval","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.scheduler.barrier.maxConcurrentTasksCheck.interval

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"configuration-properties/#spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures","title":"barrier.maxConcurrentTasksCheck.maxFailures","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"configuration-properties/#spark.scheduler.minRegisteredResourcesRatio","title":"minRegisteredResourcesRatio","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.scheduler.minRegisteredResourcesRatio

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Minimum ratio of (registered resources / total expected resources) before submitting tasks

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: (undefined)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"configuration-properties/#spark.scheduler.revive.interval","title":"spark.scheduler.revive.interval

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.scheduler.revive.interval

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The time (in millis) between resource offers revives

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 1s

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DriverEndpoint is requested to onStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkserializer","title":"spark.serializer

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The fully-qualified class name of the Serializer (of the driver and executors)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: org.apache.spark.serializer.JavaSerializer

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkEnv utility is used to create a SparkEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkConf is requested to registerKryoClasses (as a side-effect)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkshuffle","title":"spark.shuffle","text":""},{"location":"configuration-properties/#spark.shuffle.sort.io.plugin.class","title":"sort.io.plugin.class

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.sort.io.plugin.class

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Name of the class to use for shuffle IO

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: LocalDiskShuffleDataIO

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleDataIOUtils is requested to loadShuffleDataIO
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.shuffle.checksum.enabled","title":"checksum.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.checksum.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Controls checksuming of shuffle data. If enabled, Spark will calculate the checksum values for each partition data within the map output file and store the values in a checksum file on the disk. When there's shuffle data corruption detected, Spark will try to diagnose the cause (e.g., network issue, disk issue, etc.) of the corruption by using the checksum file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.shuffle.compress","title":"compress

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.compress

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Enables compressing shuffle output when stored

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.shuffle.detectCorrupt","title":"detectCorrupt

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.detectCorrupt

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Controls corruption detection in fetched blocks

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockStoreShuffleReader is requested to read combined records for a reduce task
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.shuffle.detectCorrupt.useExtraMemory","title":"detectCorrupt.useExtraMemory

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.detectCorrupt.useExtraMemory

                                                                                                                                                                                                                                                                                                                                                                                                                                                        If enabled, part of a compressed/encrypted stream will be de-compressed/de-crypted by using extra memory to detect early corruption. Any IOException thrown will cause the task to be retried once and if it fails again with same exception, then FetchFailedException will be thrown to retry previous stage

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockStoreShuffleReader is requested to read combined records for a reduce task
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.shuffle.file.buffer","title":"file.buffer

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.file.buffer

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Size of the in-memory buffer for each shuffle file output stream, in KiB unless otherwise specified. These buffers reduce the number of disk seeks and system calls made in creating intermediate shuffle files.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 32k

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Must be greater than 0 and less than or equal to 2097151 ((Integer.MAX_VALUE - 15) / 1024)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when the following are created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BypassMergeSortShuffleWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleExternalSorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • UnsafeShuffleWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExternalAppendOnlyMap
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExternalSorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.shuffle.manager","title":"manager

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.manager

                                                                                                                                                                                                                                                                                                                                                                                                                                                        A fully-qualified class name or the alias of the ShuffleManager in a Spark application

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: sort

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Supported aliases:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • sort
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • tungsten-sort

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when SparkEnv object is requested to create a \"base\" SparkEnv for a driver or an executor

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.shuffle.mapOutput.parallelAggregationThreshold","title":"mapOutput.parallelAggregationThreshold

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.mapOutput.parallelAggregationThreshold

                                                                                                                                                                                                                                                                                                                                                                                                                                                        (internal) Multi-thread is used when the number of mappers * shuffle partitions is greater than or equal to this threshold. Note that the actual parallelism is calculated by number of mappers * shuffle partitions / this threshold + 1, so this threshold should be positive.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 10000000

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MapOutputTrackerMaster is requested for the statistics of a ShuffleDependency
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.shuffle.minNumPartitionsToHighlyCompress","title":"minNumPartitionsToHighlyCompress

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.minNumPartitionsToHighlyCompress

                                                                                                                                                                                                                                                                                                                                                                                                                                                        (internal) Minimum number of partitions (threshold) for MapStatus utility to prefer a HighlyCompressedMapStatus (over CompressedMapStatus) (for ShuffleWriters).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 2000

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Must be a positive integer (above 0)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.shuffle.push.enabled","title":"push.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.push.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Enables push-based shuffle on the client side

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Works in conjunction with the server side flag spark.shuffle.push.server.mergedShuffleFileManagerImpl which needs to be set with the appropriate org.apache.spark.network.shuffle.MergedShuffleFileManager implementation for push-based shuffle to be enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Utils utility is used to determine whether push-based shuffle is enabled or not
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.shuffle.readHostLocalDisk","title":"readHostLocalDisk

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.readHostLocalDisk

                                                                                                                                                                                                                                                                                                                                                                                                                                                        If enabled (with spark.shuffle.useOldFetchProtocol disabled and spark.shuffle.service.enabled enabled), shuffle blocks requested from those block managers which are running on the same host are read from the disk directly instead of being fetched as remote blocks over the network.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.shuffle.registration.maxAttempts","title":"registration.maxAttempts

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.registration.maxAttempts

                                                                                                                                                                                                                                                                                                                                                                                                                                                        How many attempts to register a BlockManager with External Shuffle Service

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 3

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when BlockManager is requested to register with External Shuffle Server

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.shuffle.sort.bypassMergeThreshold","title":"sort.bypassMergeThreshold

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.sort.bypassMergeThreshold

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Maximum number of reduce partitions below which SortShuffleManager avoids merge-sorting data for no map-side aggregation

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 200

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SortShuffleWriter utility is used to shouldBypassMergeSort
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleExchangeExec (Spark SQL) physical operator is requested to prepareShuffleDependency
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.shuffle.spill.initialMemoryThreshold","title":"spill.initialMemoryThreshold

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.spill.initialMemoryThreshold

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Initial threshold for the size of an in-memory collection

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 5MB

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used by Spillable

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.shuffle.spill.numElementsForceSpillThreshold","title":"spill.numElementsForceSpillThreshold

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.spill.numElementsForceSpillThreshold

                                                                                                                                                                                                                                                                                                                                                                                                                                                        (internal) The maximum number of elements in memory before forcing the shuffle sorter to spill.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: Integer.MAX_VALUE

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The default value is to never force the sorter to spill, until Spark reaches some limitations, like the max page size limitation for the pointer array in the sorter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleExternalSorter is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Spillable is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Spark SQL's SortBasedAggregator is requested for an UnsafeKVExternalSorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Spark SQL's ObjectAggregationMap is requested to dumpToExternalSorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Spark SQL's UnsafeExternalRowSorter is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Spark SQL's UnsafeFixedWidthAggregationMap is requested for an UnsafeKVExternalSorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.shuffle.sync","title":"sync

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.sync

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Controls whether DiskBlockObjectWriter should force outstanding writes to disk while committing a single atomic block (i.e. all operating system buffers should synchronize with the disk to ensure that all changes to a file are in fact recorded in the storage)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when BlockManager is requested for a DiskBlockObjectWriter

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#spark.shuffle.useOldFetchProtocol","title":"useOldFetchProtocol

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.shuffle.useOldFetchProtocol

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Whether to use the old protocol while doing the shuffle block fetching. It is only enabled while we need the compatibility in the scenario of new Spark version job fetching shuffle blocks from old version external shuffle service.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkspeculation","title":"spark.speculation

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Controls Speculative Execution of Tasks

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkspeculationinterval","title":"spark.speculation.interval

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The time interval to use before checking for speculative tasks in Speculative Execution of Tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 100ms

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkspeculationmultiplier","title":"spark.speculation.multiplier

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 1.5

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkspeculationquantile","title":"spark.speculation.quantile

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The percentage of tasks that has not finished yet at which to start speculation in Speculative Execution of Tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 0.75

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkstorageblockmanagerslavetimeoutms","title":"spark.storage.blockManagerSlaveTimeoutMs

                                                                                                                                                                                                                                                                                                                                                                                                                                                        (in millis)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: spark.network.timeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkstorageblockmanagertimeoutintervalms","title":"spark.storage.blockManagerTimeoutIntervalMs

                                                                                                                                                                                                                                                                                                                                                                                                                                                        (in millis)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 60s

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkstoragelocaldiskbyexecutorscachesize","title":"spark.storage.localDiskByExecutors.cacheSize

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The max number of executors for which the local dirs are stored. This size is both applied for the driver and both for the executors side to avoid having an unbounded store. This cache will be used to avoid the network in case of fetching disk persisted RDD blocks or shuffle blocks (when spark.shuffle.readHostLocalDisk is set) from the same host.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 1000

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkstoragereplicationpolicy","title":"spark.storage.replication.policy

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: RandomBlockReplicationPolicy

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkstorageunrollmemorythreshold","title":"spark.storage.unrollMemoryThreshold

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Initial memory threshold (in bytes) to unroll (materialize) a block to store in memory

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 1024 * 1024

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Must be at most the total amount of memory available for storage

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MemoryStore is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparksubmitdeploymode","title":"spark.submit.deployMode
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • client (default)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • cluster
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparktaskcpus","title":"spark.task.cpus

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The number of CPU cores to schedule (allocate) to a task

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 1

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExecutorAllocationManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskSchedulerImpl is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • AppStatusListener is requested to handle a SparkListenerEnvironmentUpdate event
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkContext utility is used to create a TaskScheduler
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ResourceProfile is requested to getDefaultTaskResources
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • LocalityPreferredContainerPlacementStrategy is requested to numExecutorsPending
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparktaskmaxdirectresultsize","title":"spark.task.maxDirectResultSize

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Maximum size of a task result (in bytes) to be sent to the driver as a DirectTaskResult

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 1048576B (1L << 20)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskRunner is requested to run a task (and decide on the type of a serialized task result)
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparktaskmaxfailures","title":"spark.task.maxFailures

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Number of failures of a single task (of a TaskSet) before giving up on the entire TaskSet and then the job

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 4

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkplugins","title":"spark.plugins

                                                                                                                                                                                                                                                                                                                                                                                                                                                        A comma-separated list of class names implementing org.apache.spark.api.plugin.SparkPlugin to load into a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: (empty)

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Since: 3.0.0

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Set when SparkContext is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkpluginsdefaultlist","title":"spark.plugins.defaultList

                                                                                                                                                                                                                                                                                                                                                                                                                                                        FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"configuration-properties/#sparkuishowconsoleprogress","title":"spark.ui.showConsoleProgress

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Controls whether to enable ConsoleProgressBar and show the progress bar in the console

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"developer-api/","title":"Developer API","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        [TAGS]

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"driver/","title":"Driver","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        A Spark driver (aka an application's driver process) is a JVM process that hosts SparkContext.md[SparkContext] for a Spark application. It is the master node in a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        It is the cockpit of jobs and tasks execution (using scheduler:DAGScheduler.md[DAGScheduler] and scheduler:TaskScheduler.md[Task Scheduler]). It hosts spark-webui.md[Web UI] for the environment.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        .Driver with the services image::spark-driver.png[align=\"center\"]

                                                                                                                                                                                                                                                                                                                                                                                                                                                        It splits a Spark application into tasks and schedules them to run on executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        A driver is where the task scheduler lives and spawns tasks across workers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        A driver coordinates workers and overall execution of tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: spark-shell.md[Spark shell] is a Spark application and the driver. It creates a SparkContext that is available as sc.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Driver requires the additional services (beside the common ones like shuffle:ShuffleManager.md[], memory:MemoryManager.md[], storage:BlockTransferService.md[], BroadcastManager:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Listener Bus
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • rpc:index.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • scheduler:MapOutputTrackerMaster.md[] with the name MapOutputTracker
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • storage:BlockManagerMaster.md[] with the name BlockManagerMaster
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MetricsSystem with the name driver
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • OutputCommitCoordinator

                                                                                                                                                                                                                                                                                                                                                                                                                                                        CAUTION: FIXME Diagram of RpcEnv for a driver (and later executors). Perhaps it should be in the notes about RpcEnv?

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • High-level control flow of work
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Your Spark application runs as long as the Spark driver. ** Once the driver terminates, so does your Spark application.
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Creates SparkContext, RDD's, and executes transformations and actions
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Launches scheduler:Task.md[tasks]

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[driver-memory]] Driver's Memory

                                                                                                                                                                                                                                                                                                                                                                                                                                                        It can be set first using spark-submit/index.md#command-line-options[spark-submit's --driver-memory] command-line option or <> and falls back to spark-submit/index.md#environment-variables[SPARK_DRIVER_MEMORY] if not set earlier.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: It is printed out to the standard error output in spark-submit/index.md#verbose-mode[spark-submit's verbose mode].

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"driver/#driver-cores","title":"Driver Cores

                                                                                                                                                                                                                                                                                                                                                                                                                                                        It can be set first using spark-submit/index.md#driver-cores[spark-submit's --driver-cores] command-line option for cluster deploy mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: In client deploy mode the driver's memory corresponds to the memory of the JVM process the Spark application runs on.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: It is printed out to the standard error output in spark-submit/index.md#verbose-mode[spark-submit's verbose mode].

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[settings]] Settings

                                                                                                                                                                                                                                                                                                                                                                                                                                                        .Spark Properties [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Spark Property | Default Value | Description | [[spark_driver_blockManager_port]] spark.driver.blockManager.port | storage:BlockManager.md#spark_blockManager_port[spark.blockManager.port] | Port to use for the storage:BlockManager.md[BlockManager] on the driver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        More precisely, spark.driver.blockManager.port is used when core:SparkEnv.md#NettyBlockTransferService[NettyBlockTransferService is created] (while SparkEnv is created for the driver).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        | [[spark_driver_memory]] spark.driver.memory | 1g | The driver's memory size (in MiBs).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Refer to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        | [[spark_driver_cores]] spark.driver.cores | 1 | The number of CPU cores assigned to the driver in cluster deploy mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: When yarn/spark-yarn-client.md#creating-instance[Client is created] (for Spark on YARN in cluster mode only), it sets the number of cores for ApplicationManager using spark.driver.cores.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Refer to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        | [[spark_driver_extraLibraryPath]] spark.driver.extraLibraryPath | |

                                                                                                                                                                                                                                                                                                                                                                                                                                                        | [[spark_driver_extraJavaOptions]] spark.driver.extraJavaOptions | | Additional JVM options for the driver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        | [[spark.driver.appUIAddress]] spark.driver.appUIAddress

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.driver.appUIAddress is used exclusively in yarn/README.md[Spark on YARN]. It is set when yarn/spark-yarn-client-yarnclientschedulerbackend.md#start[YarnClientSchedulerBackend starts] to yarn/spark-yarn-applicationmaster.md#runExecutorLauncher[run ExecutorLauncher] (and yarn/spark-yarn-applicationmaster.md#registerAM[register ApplicationMaster] for the Spark application).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        | [[spark_driver_libraryPath]] spark.driver.libraryPath | |

                                                                                                                                                                                                                                                                                                                                                                                                                                                        |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"driver/#sparkdriverextraclasspath","title":"spark.driver.extraClassPath

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.driver.extraClassPath system property sets the additional classpath entries (e.g. jars and directories) that should be added to the driver's classpath in cluster deploy mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"driver/#note","title":"[NOTE]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        For client deploy mode you can use a properties file or command line to set spark.driver.extraClassPath.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Do not use SparkConf.md[SparkConf] since it is too late for client deploy mode given the JVM has already been set up to start a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"driver/#refer-to-spark-classmdbuildsparksubmitcommandbuildsparksubmitcommand-internal-method-for-the-very-low-level-details-of-how-it-is-handled-internally","title":"Refer to spark-class.md#buildSparkSubmitCommand[buildSparkSubmitCommand Internal Method] for the very low-level details of how it is handled internally.","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.driver.extraClassPath uses a OS-specific path separator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: Use spark-submit's spark-submit/index.md#driver-class-path[--driver-class-path command-line option] on command line to override spark.driver.extraClassPath from a spark-properties.md#spark-defaults-conf[Spark properties file].

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"local-properties/","title":"Local Properties","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        SparkContext.setLocalProperty lets you set key-value pairs that will be propagated down to tasks and can be accessed there using TaskContext.getLocalProperty.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"local-properties/#creating-logical-job-groups","title":"Creating Logical Job Groups","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        One of the purposes of local properties is to create logical groups of Spark jobs by means of properties that (regardless of the threads used to submit the jobs) makes the separate jobs launched from different threads belong to a single logical group.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        A common use case for the local property concept is to set a local property in a thread, say spark-scheduler-FairSchedulableBuilder.md[spark.scheduler.pool], after which all jobs submitted within the thread will be grouped, say into a pool by FAIR job scheduler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        val data = sc.parallelize(0 to 9)\n\nsc.setLocalProperty(\"spark.scheduler.pool\", \"myPool\")\n\n// these two jobs (one per action) will run in the myPool pool\ndata.count\ndata.collect\n\nsc.setLocalProperty(\"spark.scheduler.pool\", null)\n\n// this job will run in the default pool\ndata.count\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"master/","title":"Master","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        == Master

                                                                                                                                                                                                                                                                                                                                                                                                                                                        A master is a running Spark instance that connects to a cluster manager for resources.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The master acquires cluster nodes to run executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        CAUTION: FIXME Add it to the Spark architecture figure above.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"overview/","title":"Spark Core","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Apache Spark is an open-source distributed general-purpose cluster computing framework with (mostly) in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        You could also describe Spark as a distributed, data processing engine for batch and streaming modes featuring SQL queries, graph processing, and machine learning.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        In contrast to Hadoop\u2019s two-stage disk-based MapReduce computation engine, Spark's multi-stage (mostly) in-memory computing engine allows for running most computations in memory, and hence most of the time provides better performance for certain applications, e.g. iterative algorithms or interactive data mining (read Spark officially sets a new record in large-scale sorting).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark aims at speed, ease of use, extensibility and interactive analytics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark is a distributed platform for executing complex multi-stage applications, like machine learning algorithms, and interactive ad hoc queries. Spark provides an efficient abstraction for in-memory cluster computing called Resilient Distributed Dataset.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Using Spark Application Frameworks, Spark simplifies access to machine learning and predictive analytics at scale.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark is mainly written in http://scala-lang.org/[Scala], but provides developer API for languages like Java, Python, and R.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        If you have large amounts of data that requires low latency processing that a typical MapReduce program cannot provide, Spark is a viable alternative.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Access any data type across any data source.
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Huge demand for storage and data processing.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The Apache Spark project is an umbrella for https://jaceklaskowski.gitbooks.io/mastering-spark-sql/[SQL] (with Datasets), https://jaceklaskowski.gitbooks.io/spark-structured-streaming/[streaming], http://spark.apache.org/mllib/[machine learning] (pipelines) and http://spark.apache.org/graphx/[graph] processing engines built on top of the Spark Core. You can run them all in a single application using a consistent API.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark runs locally as well as in clusters, on-premises or in cloud. It runs on top of Hadoop YARN, Apache Mesos, standalone or in the cloud (Amazon EC2 or IBM Bluemix).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Apache Spark's https://jaceklaskowski.gitbooks.io/spark-structured-streaming/[Structured Streaming] and https://jaceklaskowski.gitbooks.io/mastering-spark-sql/[SQL] programming models with MLlib and GraphX make it easier for developers and data scientists to build applications that exploit machine learning and graph analytics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        At a high level, any Spark application creates RDDs out of some input, run rdd:index.md[(lazy) transformations] of these RDDs to some other form (shape), and finally perform rdd:index.md[actions] to collect or store data. Not much, huh?

                                                                                                                                                                                                                                                                                                                                                                                                                                                        You can look at Spark from programmer's, data engineer's and administrator's point of view. And to be honest, all three types of people will spend quite a lot of their time with Spark to finally reach the point where they exploit all the available features. Programmers use language-specific APIs (and work at the level of RDDs using transformations and actions), data engineers use higher-level abstractions like DataFrames or Pipelines APIs or external tools (that connect to Spark), and finally it all can only be possible to run because administrators set up Spark clusters to deploy Spark applications to.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        It is Spark's goal to be a general-purpose computing platform with various specialized applications frameworks on top of a single unified engine.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: When you hear \"Apache Spark\" it can be two things -- the Spark engine aka Spark Core or the Apache Spark open source project which is an \"umbrella\" term for Spark Core and the accompanying Spark Application Frameworks, i.e. Spark SQL, spark-streaming/spark-streaming.md[Spark Streaming], spark-mllib/spark-mllib.md[Spark MLlib] and spark-graphx.md[Spark GraphX] that sit on top of Spark Core and the main data abstraction in Spark called rdd:index.md[RDD - Resilient Distributed Dataset].

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"overview/#why-spark","title":"Why Spark","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Let's list a few of the many reasons for Spark. We are doing it first, and then comes the overview that lends a more technical helping hand.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"overview/#easy-to-get-started","title":"Easy to Get Started","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark offers spark-shell that makes for a very easy head start to writing and running Spark applications on the command line on your laptop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        You could then use Spark Standalone built-in cluster manager to deploy your Spark applications to a production-grade cluster to run on a full dataset.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"overview/#unified-engine-for-diverse-workloads","title":"Unified Engine for Diverse Workloads","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        As said by Matei Zaharia - the author of Apache Spark - in Introduction to AmpLab Spark Internals video (quoting with few changes):

                                                                                                                                                                                                                                                                                                                                                                                                                                                        One of the Spark project goals was to deliver a platform that supports a very wide array of diverse workflows - not only MapReduce batch jobs (there were available in Hadoop already at that time), but also iterative computations like graph algorithms or Machine Learning.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        And also different scales of workloads from sub-second interactive jobs to jobs that run for many hours.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark combines batch, interactive, and streaming workloads under one rich concise API.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark supports near real-time streaming workloads via spark-streaming/spark-streaming.md[Spark Streaming] application framework.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ETL workloads and Analytics workloads are different, however Spark attempts to offer a unified platform for a wide variety of workloads.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Graph and Machine Learning algorithms are iterative by nature and less saves to disk or transfers over network means better performance.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        There is also support for interactive workloads using Spark shell.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        You should watch the video https://youtu.be/SxAxAhn-BDU[What is Apache Spark?] by Mike Olson, Chief Strategy Officer and Co-Founder at Cloudera, who provides a very exceptional overview of Apache Spark, its rise in popularity in the open source community, and how Spark is primed to replace MapReduce as the general processing engine in Hadoop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === Leverages the Best in distributed batch data processing

                                                                                                                                                                                                                                                                                                                                                                                                                                                        When you think about distributed batch data processing, varia/spark-hadoop.md[Hadoop] naturally comes to mind as a viable solution.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark draws many ideas out of Hadoop MapReduce. They work together well - Spark on YARN and HDFS - while improving on the performance and simplicity of the distributed computing engine.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        For many, Spark is Hadoop++, i.e. MapReduce done in a better way.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        And it should not come as a surprise, without Hadoop MapReduce (its advances and deficiencies), Spark would not have been born at all.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === RDD - Distributed Parallel Scala Collections

                                                                                                                                                                                                                                                                                                                                                                                                                                                        As a Scala developer, you may find Spark's RDD API very similar (if not identical) to http://www.scala-lang.org/docu/files/collections-api/collections.html[Scala's Collections API].

                                                                                                                                                                                                                                                                                                                                                                                                                                                        It is also exposed in Java, Python and R (as well as SQL, i.e. SparkSQL, in a sense).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        So, when you have a need for distributed Collections API in Scala, Spark with RDD API should be a serious contender.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[rich-standard-library]] Rich Standard Library

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Not only can you use map and reduce (as in Hadoop MapReduce jobs) in Spark, but also a vast array of other higher-level operators to ease your Spark queries and application development.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        It expanded on the available computation styles beyond the only map-and-reduce available in Hadoop MapReduce.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === Unified development and deployment environment for all

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Regardless of the Spark tools you use - the Spark API for the many programming languages supported - Scala, Java, Python, R, or spark-shell.md[the Spark shell], or the many Spark Application Frameworks leveraging the concept of rdd:index.md[RDD], i.e. Spark SQL, spark-streaming/spark-streaming.md[Spark Streaming], spark-mllib/spark-mllib.md[Spark MLlib] and spark-graphx.md[Spark GraphX], you still use the same development and deployment environment to for large data sets to yield a result, be it a prediction (spark-mllib/spark-mllib.md[Spark MLlib]), a structured data queries (Spark SQL) or just a large distributed batch (Spark Core) or streaming (Spark Streaming) computation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        It's also very productive of Spark that teams can exploit the different skills the team members have acquired so far. Data analysts, data scientists, Python programmers, or Java, or Scala, or R, can all use the same Spark platform using tailor-made API. It makes for bringing skilled people with their expertise in different programming languages together to a Spark project.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === Interactive Exploration / Exploratory Analytics

                                                                                                                                                                                                                                                                                                                                                                                                                                                        It is also called ad hoc queries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Using spark-shell.md[the Spark shell] you can execute computations to process large amount of data (The Big Data). It's all interactive and very useful to explore the data before final production release.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Also, using the Spark shell you can access any spark-cluster.md[Spark cluster] as if it was your local machine. Just point the Spark shell to a 20-node of 10TB RAM memory in total (using --master) and use all the components (and their abstractions) like Spark SQL, Spark MLlib, spark-streaming/spark-streaming.md[Spark Streaming], and Spark GraphX.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Depending on your needs and skills, you may see a better fit for SQL vs programming APIs or apply machine learning algorithms (Spark MLlib) from data in graph data structures (Spark GraphX).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === Single Environment

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Regardless of which programming language you are good at, be it Scala, Java, Python, R or SQL, you can use the same single clustered runtime environment for prototyping, ad hoc queries, and deploying your applications leveraging the many ingestion data points offered by the Spark platform.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        You can be as low-level as using RDD API directly or leverage higher-level APIs of Spark SQL (Datasets), Spark MLlib (ML Pipelines), Spark GraphX (Graphs) or spark-streaming/spark-streaming.md[Spark Streaming] (DStreams).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Or use them all in a single application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The single programming model and execution engine for different kinds of workloads simplify development and deployment architectures.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === Data Integration Toolkit with Rich Set of Supported Data Sources

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark can read from many types of data sources -- relational, NoSQL, file systems, etc. -- using many types of data formats - Parquet, Avro, CSV, JSON.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Both, input and output data sources, allow programmers and data engineers use Spark as the platform with the large amount of data that is read from or saved to for processing, interactively (using Spark shell) or in applications.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === Tools unavailable then, at your fingertips now

                                                                                                                                                                                                                                                                                                                                                                                                                                                        As much and often as it's recommended http://c2.com/cgi/wiki?PickTheRightToolForTheJob[to pick the right tool for the job], it's not always feasible. Time, personal preference, operating system you work on are all factors to decide what is right at a time (and using a hammer can be a reasonable choice).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark embraces many concepts in a single unified development and runtime environment.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Machine learning that is so tool- and feature-rich in Python, e.g. SciKit library, can now be used by Scala developers (as Pipeline API in Spark MLlib or calling pipe()).
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DataFrames from R are available in Scala, Java, Python, R APIs.
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Single node computations in machine learning algorithms are migrated to their distributed versions in Spark MLlib.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        This single platform gives plenty of opportunities for Python, Scala, Java, and R programmers as well as data engineers (SparkR) and scientists (using proprietary enterprise data warehouses with spark-sql-thrift-server.md[Thrift JDBC/ODBC Server] in Spark SQL).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Mind the proverb https://en.wiktionary.org/wiki/if_all_you_have_is_a_hammer,_everything_looks_like_a_nail[if all you have is a hammer, everything looks like a nail], too.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === Low-level Optimizations

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Apache Spark uses a scheduler:DAGScheduler.md[directed acyclic graph (DAG) of computation stages] (aka execution DAG). It postpones any processing until really required for actions. Spark's lazy evaluation gives plenty of opportunities to induce low-level optimizations (so users have to know less to do more).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Mind the proverb https://en.wiktionary.org/wiki/less_is_more[less is more].

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === Excels at low-latency iterative workloads

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark supports diverse workloads, but successfully targets low-latency iterative ones. They are often used in Machine Learning and graph algorithms.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Many Machine Learning algorithms require plenty of iterations before the result models get optimal, like logistic regression. The same applies to graph algorithms to traverse all the nodes and edges when needed. Such computations can increase their performance when the interim partial results are stored in memory or at very fast solid state drives.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark can spark-rdd-caching.md[cache intermediate data in memory for faster model building and training]. Once the data is loaded to memory (as an initial step), reusing it multiple times incurs no performance slowdowns.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Also, graph algorithms can traverse graphs one connection per iteration with the partial result in memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Less disk access and network can make a huge difference when you need to process lots of data, esp. when it is a BIG Data.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === ETL done easier

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark gives Extract, Transform and Load (ETL) a new look with the many programming languages supported - Scala, Java, Python (less likely R). You can use them all or pick the best for a problem.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Scala in Spark, especially, makes for a much less boiler-plate code (comparing to other languages and approaches like MapReduce in Java).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[unified-api]] Unified Concise High-Level API

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark offers a unified, concise, high-level APIs for batch analytics (RDD API), SQL queries (Dataset API), real-time analysis (DStream API), machine learning (ML Pipeline API) and graph processing (Graph API).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Developers no longer have to learn many different processing engines and platforms, and let the time be spent on mastering framework APIs per use case (atop a single computation engine Spark).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === Different kinds of data processing using unified API

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark offers three kinds of data processing using batch, interactive, and stream processing with the unified API and data structures.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === Little to no disk use for better performance

                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the no-so-long-ago times, when the most prevalent distributed computing framework was varia/spark-hadoop.md[Hadoop MapReduce], you could reuse a data between computation (even partial ones!) only after you've written it to an external storage like varia/spark-hadoop.md[Hadoop Distributed Filesystem (HDFS)]. It can cost you a lot of time to compute even very basic multi-stage computations. It simply suffers from IO (and perhaps network) overhead.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        One of the many motivations to build Spark was to have a framework that is good at data reuse.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark cuts it out in a way to keep as much data as possible in memory and keep it there until a job is finished. It doesn't matter how many stages belong to a job. What does matter is the available memory and how effective you are in using Spark API (so rdd:index.md[no shuffle occur]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The less network and disk IO, the better performance, and Spark tries hard to find ways to minimize both.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === Fault Tolerance included

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Faults are not considered a special case in Spark, but obvious consequence of being a parallel and distributed system. Spark handles and recovers from faults by default without particularly complex logic to deal with them.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === Small Codebase Invites Contributors

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark's design is fairly simple and the code that comes out of it is not huge comparing to the features it offers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The reasonably small codebase of Spark invites project contributors - programmers who extend the platform and fix bugs in a more steady pace.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[i-want-more]] Further reading or watching

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • (video) https://youtu.be/L029ZNBG7bk[Keynote: Spark 2.0 - Matei Zaharia, Apache Spark Creator and CTO of Databricks]
                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"push-based-shuffle/","title":"Push-Based Shuffle","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Push-Based Shuffle is a new feature of Apache Spark 3.2.0 (cf. SPARK-30602) to improve shuffle efficiency.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Push-based shuffle is enabled using spark.shuffle.push.enabled configuration property and can only be used in a Spark application submitted to YARN cluster manager, with external shuffle service enabled, IO encryption disabled, and relocation of serialized objects supported.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"spark-debugging/","title":"Debugging Spark","text":""},{"location":"spark-debugging/#using-spark-shell-and-intellij-idea","title":"Using spark-shell and IntelliJ IDEA","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Start spark-shell with SPARK_SUBMIT_OPTS environment variable that configures the JVM's JDWP.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        SPARK_SUBMIT_OPTS=\"-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005\" ./bin/spark-shell\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Attach IntelliJ IDEA to the JVM process using Run > Attach to Local Process menu.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"spark-debugging/#using-sbt","title":"Using sbt","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Use sbt -jvm-debug 5005, connect to the remote JVM at the port 5005 using IntelliJ IDEA, place breakpoints on the desired lines of the source code of Spark.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ sbt -jvm-debug 5005\nListening for transport dt_socket at address: 5005\n...\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Run Spark context and the breakpoints get triggered.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> val sc = new SparkContext(conf)\n15/11/14 22:58:46 INFO SparkContext: Running Spark version 1.6.0-SNAPSHOT\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Tip

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Read Debugging chapter in IntelliJ IDEA's Help.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"spark-logging/","title":"Spark Logging","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Apache Spark uses Apache Log4j 2 for logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"spark-logging/#conflog4j2properties","title":"conf/log4j2.properties","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The default logging for Spark applications is in conf/log4j2.properties.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Use conf/log4j2.properties.template as a starting point.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"spark-logging/#logging-levels","title":"Logging Levels

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The valid logging levels are log4j's Levels (from most specific to least):

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Name Description OFF No events will be logged FATAL A fatal event that will prevent the application from continuing ERROR An error in the application, possibly recoverable WARN An event that might possible lead to an error INFO An event for informational purposes DEBUG A general debugging event TRACE A fine-grained debug message, typically capturing the flow through the application ALL All events should be logged

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The names of the logging levels are case-insensitive.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"spark-logging/#turn-logging-off","title":"Turn Logging Off

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The following sample conf/log4j2.properties turns all logging of Apache Spark (and Apache Hadoop) off.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        # Set to debug or trace if log4j initialization fails\nstatus = warn\n\n# Name of the configuration\nname = exploring-internals\n\n# Console appender configuration\nappender.console.type = Console\nappender.console.name = consoleLogger\nappender.console.layout.type = PatternLayout\nappender.console.layout.pattern = %d{YYYY-MM-dd HH:mm:ss} [%t] %-5p %c:%L - %m%n\nappender.console.target = SYSTEM_OUT\n\nrootLogger.level = off\nrootLogger.appenderRef.stdout.ref = consoleLogger\n\nlogger.spark.name = org.apache.spark\nlogger.spark.level = off\n\nlogger.hadoop.name = org.apache.hadoop\nlogger.hadoop.level = off\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"spark-logging/#setting-default-log-level-programatically","title":"Setting Default Log Level Programatically

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Setting Default Log Level Programatically

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"spark-logging/#setting-log-levels-in-spark-applications","title":"Setting Log Levels in Spark Applications

                                                                                                                                                                                                                                                                                                                                                                                                                                                        In standalone Spark applications or while in Spark Shell session, use the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        import org.apache.log4j.{Level, Logger}\n\nLogger.getLogger(classOf[RackResolver]).getLevel\nLogger.getLogger(\"org\").setLevel(Level.OFF)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"spark-properties/","title":"Spark Properties and spark-defaults.conf Properties File","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark properties are the means of tuning the execution environment of a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The default Spark properties file is <$SPARK_HOME/conf/spark-defaults.conf>> that could be overriden using spark-submit with the spark-submit/index.md#properties-file[--properties-file] command-line option.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        .Environment Variables [options=\"header\",width=\"100%\"] |=== | Environment Variable | Default Value | Description | SPARK_CONF_DIR | $\\{SPARK_HOME}/conf | Spark's configuration directory (with spark-defaults.conf) |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                        TIP: Read the official documentation of Apache Spark on http://spark.apache.org/docs/latest/configuration.html[Spark Configuration].

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[spark-defaults-conf]] spark-defaults.conf -- Default Spark Properties File

                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark-defaults.conf (under SPARK_CONF_DIR or $SPARK_HOME/conf) is the default properties file with the Spark properties of your Spark applications.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: spark-defaults.conf is loaded by spark-AbstractCommandBuilder.md#loadPropertiesFile[AbstractCommandBuilder's loadPropertiesFile internal method].

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[getDefaultPropertiesFile]] Calculating Path of Default Spark Properties -- Utils.getDefaultPropertiesFile method

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"spark-properties/#source-scala","title":"[source, scala]","text":""},{"location":"spark-properties/#getdefaultpropertiesfileenv-mapstring-string-sysenv-string","title":"getDefaultPropertiesFile(env: Map[String, String] = sys.env): String","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        getDefaultPropertiesFile calculates the absolute path to spark-defaults.conf properties file that can be either in directory specified by SPARK_CONF_DIR environment variable or $SPARK_HOME/conf directory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: getDefaultPropertiesFile is part of private[spark] org.apache.spark.util.Utils object.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"spark-tips-and-tricks-access-private-members-spark-shell/","title":"Access private members in Scala in Spark shell","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        == Access private members in Scala in Spark shell

                                                                                                                                                                                                                                                                                                                                                                                                                                                        If you ever wanted to use private[spark] members in Spark using the Scala programming language, e.g. toy with org.apache.spark.scheduler.DAGScheduler or similar, you will have to use the following trick in Spark shell - use :paste -raw as described in https://issues.scala-lang.org/browse/SI-5299[REPL: support for package definition].

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Open spark-shell and execute :paste -raw that allows you to enter any valid Scala code, including package.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The following snippet shows how to access private[spark] member DAGScheduler.RESUBMIT_TIMEOUT:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> :paste -raw\n// Entering paste mode (ctrl-D to finish)\n\npackage org.apache.spark\n\nobject spark {\n  def test = {\n    import org.apache.spark.scheduler._\n    println(DAGScheduler.RESUBMIT_TIMEOUT == 200)\n  }\n}\n\nscala> spark.test\ntrue\n\nscala> sc.version\nres0: String = 1.6.0-SNAPSHOT\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"spark-tips-and-tricks-running-spark-windows/","title":"Running Spark Applications on Windows","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        == Running Spark Applications on Windows

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Running Spark applications on Windows in general is no different than running it on other operating systems like Linux or macOS.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: A Spark application could be spark-shell.md[spark-shell] or your own custom Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        What makes the huge difference between the operating systems is Hadoop that is used internally for file system access in Spark.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        You may run into few minor issues when you are on Windows due to the way Hadoop works with Windows' POSIX-incompatible NTFS filesystem.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: You do not have to install Apache Hadoop to work with Spark or run Spark applications.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        TIP: Read the Apache Hadoop project's https://wiki.apache.org/hadoop/WindowsProblems[Problems running Hadoop on Windows].

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Among the issues is the infamous java.io.IOException when running Spark Shell (below a stacktrace from Spark 2.0.2 on Windows 10 so the line numbers may be different in your case).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        16/12/26 21:34:11 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path\njava.io.IOException: Could not locate executable null\\bin\\winutils.exe in the Hadoop binaries.\n  at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)\n  at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)\n  at org.apache.hadoop.util.Shell.<clinit>(Shell.java:387)\n  at org.apache.hadoop.hive.conf.HiveConf$ConfVars.findHadoopBinary(HiveConf.java:2327)\n  at org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:365)\n  at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105)\n  at java.lang.Class.forName0(Native Method)\n  at java.lang.Class.forName(Class.java:348)\n  at org.apache.spark.util.Utils$.classForName(Utils.scala:228)\n  at org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:963)\n  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:91)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"spark-tips-and-tricks-running-spark-windows/#note","title":"[NOTE]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        You need to have Administrator rights on your laptop. All the following commands must be executed in a command-line window (cmd) ran as Administrator, i.e. using Run as administrator option while executing cmd.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"spark-tips-and-tricks-running-spark-windows/#read-the-official-document-in-microsoft-technet-httpstechnetmicrosoftcomen-uslibrarycc947813vws10aspxstart-a-command-prompt-as-an-administrator","title":"Read the official document in Microsoft TechNet -- ++https://technet.microsoft.com/en-us/library/cc947813(v=ws.10).aspx++[Start a Command Prompt as an Administrator].","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Download winutils.exe binary from https://github.com/steveloughran/winutils repository.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: You should select the version of Hadoop the Spark distribution was compiled with, e.g. use hadoop-2.7.1 for Spark 2 (https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe[here is the direct link to winutils.exe binary]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Save winutils.exe binary to a directory of your choice, e.g. c:\\hadoop\\bin.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Set HADOOP_HOME to reflect the directory with winutils.exe (without bin).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        set HADOOP_HOME=c:\\hadoop\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Set PATH environment variable to include %HADOOP_HOME%\\bin as follows:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        set PATH=%HADOOP_HOME%\\bin;%PATH%\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        TIP: Define HADOOP_HOME and PATH environment variables in Control Panel so any Windows program would use them.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Create C:\\tmp\\hive directory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"spark-tips-and-tricks-running-spark-windows/#note_1","title":"[NOTE]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        c:\\tmp\\hive directory is the default value of https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.scratchdir[hive.exec.scratchdir configuration property] in Hive 0.14.0 and later and Spark uses a custom build of Hive 1.2.1.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"spark-tips-and-tricks-running-spark-windows/#you-can-change-hiveexecscratchdir-configuration-property-to-another-directory-as-described-in-wzxhzdk27-configuration-property-in-this-document","title":"You can change hive.exec.scratchdir configuration property to another directory as described in <hive.exec.scratchdir Configuration Property>> in this document.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Execute the following command in cmd that you started using the option Run as administrator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        winutils.exe chmod -R 777 C:\\tmp\\hive\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Check the permissions (that is one of the commands that are executed under the covers):

                                                                                                                                                                                                                                                                                                                                                                                                                                                        winutils.exe ls -F C:\\tmp\\hive\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Open spark-shell and observe the output (perhaps with few WARN messages that you can simply disregard).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        As a verification step, execute the following line to display the content of a DataFrame:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"spark-tips-and-tricks-running-spark-windows/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> spark.range(1).withColumn(\"status\", lit(\"All seems fine. Congratulations!\")).show(false) +---+--------------------------------+ |id |status | +---+--------------------------------+ |0 |All seems fine. Congratulations!| +---+--------------------------------+

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"spark-tips-and-tricks-running-spark-windows/#note_2","title":"[NOTE]

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Disregard WARN messages when you start spark-shell. They are harmless.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"spark-tips-and-tricks-running-spark-windows/#161226-220541-warn-general-plugin-bundle-orgdatanucleus-is-already-registered-ensure-you-dont-have-multiple-jar-versions-of-the-same-plugin-in-the-classpath-the-url-filecspark-202-bin-hadoop27jarsdatanucleus-core-3210jar-is-already-registered-and-you-are-trying-to-register-an-identical-plugin-located-at-url-filecspark-202-bin-hadoop27binjarsdatanucleus-core-3210jar-161226-220541-warn-general-plugin-bundle-orgdatanucleusapijdo-is-already-registered-ensure-you-dont-have-multiple-jar-versions-of-the-same-plugin-in-the-classpath-the-url-filecspark-202-bin-hadoop27jarsdatanucleus-api-jdo-326jar-is-already-registered-and-you-are-trying-to-register-an-identical-plugin-located-at-url-filecspark-202-bin-hadoop27binjarsdatanucleus-api-jdo-326jar-161226-220541-warn-general-plugin-bundle-orgdatanucleusstorerdbms-is-already-registered-ensure-you-dont-have-multiple-jar-versions-of-the-same-plugin-in-the-classpath-the-url-filecspark-202-bin-hadoop27binjarsdatanucleus-rdbms-329jar-is-already-registered-and-you-are-trying-to-register-an-identical-plugin-located-at-url-filecspark-202-bin-hadoop27jarsdatanucleus-rdbms-329jar","title":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                        16/12/26 22:05:41 WARN General: Plugin (Bundle) \"org.datanucleus\" is already registered. Ensure you dont have multiple JAR versions of\nthe same plugin in the classpath. The URL \"file:/C:/spark-2.0.2-bin-hadoop2.7/jars/datanucleus-core-3.2.10.jar\" is already registered,\nand you are trying to register an identical plugin located at URL \"file:/C:/spark-2.0.2-bin-hadoop2.7/bin/../jars/datanucleus-core-\n3.2.10.jar.\"\n16/12/26 22:05:41 WARN General: Plugin (Bundle) \"org.datanucleus.api.jdo\" is already registered. Ensure you dont have multiple JAR\nversions of the same plugin in the classpath. The URL \"file:/C:/spark-2.0.2-bin-hadoop2.7/jars/datanucleus-api-jdo-3.2.6.jar\" is already\nregistered, and you are trying to register an identical plugin located at URL \"file:/C:/spark-2.0.2-bin-\nhadoop2.7/bin/../jars/datanucleus-api-jdo-3.2.6.jar.\"\n16/12/26 22:05:41 WARN General: Plugin (Bundle) \"org.datanucleus.store.rdbms\" is already registered. Ensure you dont have multiple JAR\nversions of the same plugin in the classpath. The URL \"file:/C:/spark-2.0.2-bin-hadoop2.7/bin/../jars/datanucleus-rdbms-3.2.9.jar\" is\nalready registered, and you are trying to register an identical plugin located at URL \"file:/C:/spark-2.0.2-bin-\nhadoop2.7/jars/datanucleus-rdbms-3.2.9.jar.\"\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        If you see the above output, you're done. You should now be able to run Spark applications on your Windows. Congrats!

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[changing-hive.exec.scratchdir]] Changing hive.exec.scratchdir Configuration Property

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Create a hive-site.xml file with the following content:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        <configuration>\n  <property>\n    <name>hive.exec.scratchdir</name>\n    <value>/tmp/mydir</value>\n    <description>Scratch space for Hive jobs</description>\n  </property>\n</configuration>\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Start a Spark application, e.g. spark-shell, with HADOOP_CONF_DIR environment variable set to the directory with hive-site.xml.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        HADOOP_CONF_DIR=conf ./bin/spark-shell\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"spark-tips-and-tricks-sparkexception-task-not-serializable/","title":"Task not serializable Exception","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        == org.apache.spark.SparkException: Task not serializable

                                                                                                                                                                                                                                                                                                                                                                                                                                                        When you run into org.apache.spark.SparkException: Task not serializable exception, it means that you use a reference to an instance of a non-serializable class inside a transformation. See the following example:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        \u279c  spark git:(master) \u2717 ./bin/spark-shell\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 1.6.0-SNAPSHOT\n      /_/\n\nUsing Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)\nType in expressions to have them evaluated.\nType :help for more information.\n\nscala> class NotSerializable(val num: Int)\ndefined class NotSerializable\n\nscala> val notSerializable = new NotSerializable(10)\nnotSerializable: NotSerializable = NotSerializable@2700f556\n\nscala> sc.parallelize(0 to 10).map(_ => notSerializable.num).count\norg.apache.spark.SparkException: Task not serializable\n  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)\n  at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)\n  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)\n  at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)\n  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:318)\n  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:317)\n  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)\n  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)\n  at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)\n  at org.apache.spark.rdd.RDD.map(RDD.scala:317)\n  ... 48 elided\nCaused by: java.io.NotSerializableException: NotSerializable\nSerialization stack:\n    - object not serializable (class: NotSerializable, value: NotSerializable@2700f556)\n    - field (class: $iw, name: notSerializable, type: class NotSerializable)\n    - object (class $iw, $iw@10e542f3)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@729feae8)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@5fc3b20b)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@36dab184)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@5eb974)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@79c514e4)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@5aeaee3)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@2be9425f)\n    - field (class: $line18.$read, name: $iw, type: class $iw)\n    - object (class $line18.$read, $line18.$read@6311640d)\n    - field (class: $iw, name: $line18$read, type: class $line18.$read)\n    - object (class $iw, $iw@c9cd06e)\n    - field (class: $iw, name: $outer, type: class $iw)\n    - object (class $iw, $iw@6565691a)\n    - field (class: $anonfun$1, name: $outer, type: class $iw)\n    - object (class $anonfun$1, <function1>)\n  at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)\n  at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)\n  at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)\n  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)\n  ... 57 more\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === Further reading

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html[Job aborted due to stage failure: Task not serializable]
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • https://issues.apache.org/jira/browse/SPARK-5307[Add utility to help with NotSerializableException debugging]
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • http://stackoverflow.com/q/22592811/1305344[Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects]
                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"spark-tips-and-tricks/","title":"Spark Tips and Tricks","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        = Spark Tips and Tricks

                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[SPARK_PRINT_LAUNCH_COMMAND]] Print Launch Command of Spark Scripts

                                                                                                                                                                                                                                                                                                                                                                                                                                                        SPARK_PRINT_LAUNCH_COMMAND environment variable controls whether the Spark launch command is printed out to the standard error output, i.e. System.err, or not.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark Command: [here comes the command]\n========================================\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        All the Spark shell scripts use org.apache.spark.launcher.Main class internally that checks SPARK_PRINT_LAUNCH_COMMAND and when set (to any value) will print out the entire command line to launch it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell\nSpark Command: /Library/Java/JavaVirtualMachines/Current/Contents/Home/bin/java -cp /Users/jacek/dev/oss/spark/conf/:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.1.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar -Dscala.usejavacp=true -Xms1g -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://localhost:7077 --class org.apache.spark.repl.Main --name Spark shell spark-shell\n========================================\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        == Show Spark version in Spark shell

                                                                                                                                                                                                                                                                                                                                                                                                                                                        In spark-shell, use sc.version or org.apache.spark.SPARK_VERSION to know the Spark version:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> sc.version\nres0: String = 1.6.0-SNAPSHOT\n\nscala> org.apache.spark.SPARK_VERSION\nres1: String = 1.6.0-SNAPSHOT\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                        == Resolving local host name

                                                                                                                                                                                                                                                                                                                                                                                                                                                        When you face networking issues when Spark can't resolve your local hostname or IP address, use the preferred SPARK_LOCAL_HOSTNAME environment variable as the custom host name or SPARK_LOCAL_IP as the custom IP that is going to be later resolved to a hostname.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark checks them out before using http://docs.oracle.com/javase/8/docs/api/java/net/InetAddress.html#getLocalHost--[java.net.InetAddress.getLocalHost()] (consult https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L759[org.apache.spark.util.Utils.findLocalInetAddress()] method).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        You may see the following WARN messages in the logs when Spark finished the resolving process:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Your hostname, [hostname] resolves to a loopback address: [host-address]; using...\nSet SPARK_LOCAL_IP if you need to bind to another address\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"spark-tips-and-tricks/#starting-standalone-master-and-workers-on-windows-7","title":"Starting standalone Master and workers on Windows 7","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Windows 7 users can use spark-class to start Spark Standalone as there are no launch scripts for the Windows platform.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        ./bin/spark-class org.apache.spark.deploy.master.Master -h localhost\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                        ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"speculative-execution-of-tasks/","title":"Speculative Execution of Tasks","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Speculative tasks (also speculatable tasks or task strugglers) are tasks that run slower than most (FIXME the setting) of the all tasks in a job.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Speculative execution of tasks is a health-check procedure that checks for tasks to be speculated, i.e. running slower in a stage than the median of all successfully completed tasks in a taskset (FIXME the setting). Such slow tasks will be re-submitted to another worker. It will not stop the slow tasks, but run a new copy in parallel.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The thread starts as TaskSchedulerImpl starts in spark-cluster.md[clustered deployment modes] with configuration-properties.md#spark.speculation[spark.speculation] enabled. It executes periodically every configuration-properties.md#spark.speculation.interval[spark.speculation.interval] after the initial spark.speculation.interval passes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        When enabled, you should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"speculative-execution-of-tasks/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"speculative-execution-of-tasks/#starting-speculative-execution-thread","title":"Starting speculative execution thread","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        It works as scheduler:TaskSchedulerImpl.md#task-scheduler-speculation[task-scheduler-speculation daemon thread pool] (using j.u.c.ScheduledThreadPoolExecutor with core pool size of 1).

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The job with speculatable tasks should finish while speculative tasks are running, and it will leave these tasks running - no KILL command yet.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        It uses checkSpeculatableTasks method that asks rootPool to check for speculatable tasks. If there are any, SchedulerBackend is called for scheduler:SchedulerBackend.md#reviveOffers[reviveOffers].

                                                                                                                                                                                                                                                                                                                                                                                                                                                        CAUTION: FIXME How does Spark handle repeated results of speculative tasks since there are copies launched?

                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"workers/","title":"Workers","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                        == Workers

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Workers (aka slaves) are running Spark instances where executors live to execute tasks. They are the compute nodes in Spark.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        CAUTION: FIXME Are workers perhaps part of Spark Standalone only?

                                                                                                                                                                                                                                                                                                                                                                                                                                                        CAUTION: FIXME How many executors are spawned per worker?

                                                                                                                                                                                                                                                                                                                                                                                                                                                        A worker receives serialized tasks that it runs in a thread pool.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        It hosts a local storage:BlockManager.md[Block Manager] that serves blocks to other workers in a Spark cluster. Workers communicate among themselves using their Block Manager instances.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        CAUTION: FIXME Diagram of a driver with workers as boxes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Explain task execution in Spark and understand Spark\u2019s underlying execution model.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        New vocabulary often faced in Spark UI

                                                                                                                                                                                                                                                                                                                                                                                                                                                        SparkContext.md[When you create SparkContext], each worker starts an executor. This is a separate process (JVM), and it loads your jar, too. The executors connect back to your driver program. Now the driver can send them commands, like flatMap, map and reduceByKey. When the driver quits, the executors shut down.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        A new process is not started for each step. A new process is started on each worker when the SparkContext is constructed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The executor deserializes the command (this is possible because it has loaded your jar), and executes it on a partition.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Shortly speaking, an application in Spark is executed in three steps:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        1. Create RDD graph, i.e. DAG (directed acyclic graph) of RDDs to represent entire computation.
                                                                                                                                                                                                                                                                                                                                                                                                                                                        2. Create stage graph, i.e. a DAG of stages that is a logical execution plan based on the RDD graph. Stages are created by breaking the RDD graph at shuffle boundaries.
                                                                                                                                                                                                                                                                                                                                                                                                                                                        3. Based on the plan, schedule and execute tasks on workers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        exercises/spark-examples-wordcount-spark-shell.md[In the WordCount example], the RDD graph is as follows:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        file -> lines -> words -> per-word count -> global word count -> output

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Based on this graph, two stages are created. The stage creation rule is based on the idea of pipelining as many rdd:index.md[narrow transformations] as possible. RDD operations with \"narrow\" dependencies, like map() and filter(), are pipelined together into one set of tasks in each stage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the end, every stage will only have shuffle dependencies on other stages, and may compute multiple operations inside it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the WordCount example, the narrow transformation finishes at per-word count. Therefore, you get two stages:

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • file -> lines -> words -> per-word count
                                                                                                                                                                                                                                                                                                                                                                                                                                                        • global word count -> output

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Once stages are defined, Spark will generate scheduler:Task.md[tasks] from scheduler:Stage.md[stages]. The first stage will create scheduler:ShuffleMapTask.md[ShuffleMapTask]s with the last stage creating scheduler:ResultTask.md[ResultTask]s because in the last stage, one action operation is included to produce results.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The number of tasks to be generated depends on how your files are distributed. Suppose that you have 3 three different files in three different nodes, the first stage will generate 3 tasks: one task per partition.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Therefore, you should not map your steps to tasks directly. A task belongs to a stage, and is related to a partition.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        The number of tasks being generated in each stage will be equal to the number of partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[Cleanup]] Cleanup

                                                                                                                                                                                                                                                                                                                                                                                                                                                        CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[settings]] Settings

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • spark.worker.cleanup.enabled (default: false) <> enabled."},{"location":"accumulators/","title":"Accumulators","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                          Accumulators are shared variables that accumulate values from executors on the driver using associative and commutative \"add\" operation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                          The main abstraction is AccumulatorV2.

                                                                                                                                                                                                                                                                                                                                                                                                                                                          Accumulators are registered (created) using SparkContext with or without a name. Only named accumulators are displayed in web UI.

                                                                                                                                                                                                                                                                                                                                                                                                                                                          DAGScheduler is responsible for updating accumulators (from partial values from tasks running on executors every heartbeat).

                                                                                                                                                                                                                                                                                                                                                                                                                                                          Accumulators are serializable so they can safely be referenced in the code executed in executors and then safely send over the wire for execution.

                                                                                                                                                                                                                                                                                                                                                                                                                                                          // on the driver\nval counter = sc.longAccumulator(\"counter\")\n\nsc.parallelize(1 to 9).foreach { x =>\n  // on executors\n  counter.add(x) }\n\n// on the driver\nprintln(counter.value)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"accumulators/#further-reading","title":"Further Reading","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Performance and Scalability of Broadcast in Spark
                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"accumulators/AccumulableInfo/","title":"AccumulableInfo","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                          AccumulableInfo represents an update to an AccumulatorV2.

                                                                                                                                                                                                                                                                                                                                                                                                                                                          AccumulableInfo is used to transfer accumulator updates from executors to the driver every executor heartbeat or when a task finishes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"accumulators/AccumulableInfo/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                          AccumulableInfo takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Accumulator ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Partial Update
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Partial Value
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • internal flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • countFailedValues flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Metadata (default: None)

                                                                                                                                                                                                                                                                                                                                                                                                                                                            AccumulableInfo is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • AccumulatorV2 is requested to convert itself to an AccumulableInfo
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • JsonProtocol is requested to accumulableInfoFromJson
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SQLMetric (Spark SQL) is requested to convert itself to an AccumulableInfo
                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"accumulators/AccumulableInfo/#internal-flag","title":"internal Flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                            internal: Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                            AccumulableInfo is given an internal flag when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            internal flag denotes whether this accumulator is internal.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            internal is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LiveEntityHelpers is requested for newAccumulatorInfos
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • JsonProtocol is requested to accumulableInfoToJson
                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"accumulators/AccumulatorContext/","title":"AccumulatorContext","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[AccumulatorContext]] AccumulatorContext

                                                                                                                                                                                                                                                                                                                                                                                                                                                            AccumulatorContext is a private[spark] internal object used to track accumulators by Spark itself using an internal originals lookup table. Spark uses the AccumulatorContext object to register and unregister accumulators.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            The originals lookup table maps accumulator identifier to the accumulator itself.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Every accumulator has its own unique accumulator id that is assigned using the internal nextId counter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[register]] register Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[newId]] newId Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[AccumulatorContext-SQL_ACCUM_IDENTIFIER]] AccumulatorContext.SQL_ACCUM_IDENTIFIER

                                                                                                                                                                                                                                                                                                                                                                                                                                                            AccumulatorContext.SQL_ACCUM_IDENTIFIER is an internal identifier for Spark SQL's internal accumulators. The value is sql and Spark uses it to distinguish spark-sql-SparkPlan.md#SQLMetric[Spark SQL metrics] from others.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"accumulators/AccumulatorSource/","title":"AccumulatorSource","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            AccumulatorSource is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"accumulators/AccumulatorV2/","title":"AccumulatorV2","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            AccumulatorV2[IN, OUT] is an abstraction of accumulators

                                                                                                                                                                                                                                                                                                                                                                                                                                                            AccumulatorV2 is a Java Serializable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"accumulators/AccumulatorV2/#contract","title":"Contract","text":""},{"location":"accumulators/AccumulatorV2/#adding-value","title":"Adding Value
                                                                                                                                                                                                                                                                                                                                                                                                                                                            add(\n  v: IN): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Accumulates (adds) the given v value to this accumulator

                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"accumulators/AccumulatorV2/#copying-accumulator","title":"Copying Accumulator
                                                                                                                                                                                                                                                                                                                                                                                                                                                            copy(): AccumulatorV2[IN, OUT]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"accumulators/AccumulatorV2/#is-zero-value","title":"Is Zero Value
                                                                                                                                                                                                                                                                                                                                                                                                                                                            isZero: Boolean\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"accumulators/AccumulatorV2/#merging-updates","title":"Merging Updates
                                                                                                                                                                                                                                                                                                                                                                                                                                                            merge(\n  other: AccumulatorV2[IN, OUT]): Unit\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"accumulators/AccumulatorV2/#resetting-accumulator","title":"Resetting Accumulator
                                                                                                                                                                                                                                                                                                                                                                                                                                                            reset(): Unit\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"accumulators/AccumulatorV2/#value","title":"Value
                                                                                                                                                                                                                                                                                                                                                                                                                                                            value: OUT\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                            The current value of this accumulator

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskRunner is requested to collectAccumulatorsAndResetStatusOnFailure
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • AccumulatorSource is requested to register
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to update accumulators
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskSchedulerImpl is requested to executorHeartbeatReceived
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskSetManager is requested to handleSuccessfulTask
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • JsonProtocol is requested to taskEndReasonFromJson
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • others
                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"accumulators/AccumulatorV2/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • AggregatingAccumulator (Spark SQL)
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • CollectionAccumulator
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DoubleAccumulator
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • EventTimeStatsAccum (Spark Structured Streaming)
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LongAccumulator
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SetAccumulator (Spark SQL)
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SQLMetric (Spark SQL)
                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"accumulators/AccumulatorV2/#converting-this-accumulator-to-accumulableinfo","title":"Converting this Accumulator to AccumulableInfo
                                                                                                                                                                                                                                                                                                                                                                                                                                                            toInfo(\n  update: Option[Any],\n  value: Option[Any]): AccumulableInfo\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                            toInfo determines whether the accumulator is internal based on the name (and whether it uses the internal.metrics prefix) and uses it to create an AccumulableInfo.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            toInfo\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskRunner is requested to collectAccumulatorsAndResetStatusOnFailure
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to updateAccumulators
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskSchedulerImpl is requested to executorHeartbeatReceived
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • JsonProtocol is requested to taskEndReasonFromJson
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SQLAppStatusListener (Spark SQL) is requested to handle a SparkListenerTaskEnd event (onTaskEnd)
                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"accumulators/AccumulatorV2/#registering-accumulator","title":"Registering Accumulator
                                                                                                                                                                                                                                                                                                                                                                                                                                                            register(\n  sc: SparkContext,\n  name: Option[String] = None,\n  countFailedValues: Boolean = false): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                            register...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                            register\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkContext is requested to register an accumulator
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskMetrics is requested to register task accumulators
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • CollectMetricsExec (Spark SQL) is requested for an AggregatingAccumulator
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SQLMetrics (Spark SQL) is used to create a performance metric
                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"accumulators/AccumulatorV2/#serializing-accumulatorv2","title":"Serializing AccumulatorV2
                                                                                                                                                                                                                                                                                                                                                                                                                                                            writeReplace(): Any\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                            writeReplace is part of the Serializable (Java) abstraction (to designate an alternative object to be used when writing an object to the stream).

                                                                                                                                                                                                                                                                                                                                                                                                                                                            writeReplace...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"accumulators/AccumulatorV2/#deserializing-accumulatorv2","title":"Deserializing AccumulatorV2
                                                                                                                                                                                                                                                                                                                                                                                                                                                            readObject(\n  in: ObjectInputStream): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                            readObject is part of the Serializable (Java) abstraction (for special handling during deserialization).

                                                                                                                                                                                                                                                                                                                                                                                                                                                            readObject reads the non-static and non-transient fields of the AccumulatorV2 from the given ObjectInputStream.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the atDriverSide internal flag is turned on, readObject turns it off (to indicate readObject is executed on an executor). Otherwise, atDriverSide internal flag is turned on.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            readObject requests the active TaskContext to register this accumulator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"accumulators/InternalAccumulator/","title":"InternalAccumulator","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            InternalAccumulator is an utility with field names for internal accumulators.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"accumulators/InternalAccumulator/#internalmetrics-prefix","title":"internal.metrics Prefix

                                                                                                                                                                                                                                                                                                                                                                                                                                                            internal.metrics. is the prefix of metrics that are considered internal and should not be displayed in web UI.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            internal.metrics. is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • AccumulatorV2 is requested to convert itself to AccumulableInfo and writeReplace
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • JsonProtocol is requested to accumValueToJson and accumValueFromJson
                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"barrier-execution-mode/","title":"Barrier Execution Mode","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Barrier Execution Mode (Barrier Scheduling) introduces a strong requirement on Spark Scheduler to launch all tasks of a Barrier Stage at the same time or not at all (and consequently wait until required resources are available). Moreover, a failure of a single task of a barrier stage fails the whole stage (and so the other tasks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Barrier Execution Mode allows for as many tasks to be executed concurrently as ResourceProfile permits (that is enforced upon scheduling a barrier job).

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Barrier Execution Mode aims at making Distributed Deep Learning with Apache Spark easier (or even possible).

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Rephrasing dmlc/xgboost, Barrier Execution Mode makes sure that:

                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. All tasks of a barrier stage are all launched at once. If there is not enough task slots, the exception will be produced

                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. Tasks either all succeed or fail. Upon a task failure Spark aborts all the other tasks (TaskScheduler will kill all other running tasks) and restarts the whole barrier stage

                                                                                                                                                                                                                                                                                                                                                                                                                                                            3. Spark makes no assumption that tasks don't talk to each other. Actually, it is the opposite. Spark provides BarrierTaskContext which facilitates tasks discovery (e.g., barrier, allGather)

                                                                                                                                                                                                                                                                                                                                                                                                                                                            4. Permits restarting a training from a known state (checkpoint) in case of a failure

                                                                                                                                                                                                                                                                                                                                                                                                                                                            From the Design doc: Barrier Execution Mode:

                                                                                                                                                                                                                                                                                                                                                                                                                                                            In Spark, a task in a stage doesn't depend on any other task in the same stage, and hence it can be scheduled independently.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            That gives Spark a freedom to schedule tasks in as many task batches as needed. So, 5 tasks can be scheduled on 1 CPU core quite easily in 5 consecutive batches. That's unlike MPI (or non-MapReduce scheduling systems) that allows for greater flexibility and inter-task dependency.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Later in Design doc: Barrier Execution Mode:

                                                                                                                                                                                                                                                                                                                                                                                                                                                            In MPI, all workers start at the same time and pass messages around.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            To embed this workload in Spark, we need to introduce a new scheduling model, tentatively named \"barrier scheduling\", which launches the tasks at the same time and provides users enough information and tooling to embed distributed DL training into a Spark pipeline.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"barrier-execution-mode/#barrier-rdd","title":"Barrier RDD","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Barrier RDD is a RDDBarrier.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"barrier-execution-mode/#barrier-stage","title":"Barrier Stage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Barrier Stage is a Stage with at least one Barrier RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"barrier-execution-mode/#abstractions","title":"Abstractions","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BarrierTaskContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • RDDBarrier
                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"barrier-execution-mode/#barrier","title":"RDD.barrier Operator","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Barrier Execution Mode is based on RDD.barrier operator to indicate that Spark Scheduler must launch the tasks together for the current stage (and mark the current stage as a barrier stage).

                                                                                                                                                                                                                                                                                                                                                                                                                                                            barrier(): RDDBarrier[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                            RDD.barrier creates a RDDBarrier that comes with the barrier-aware mapPartitions transformation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            mapPartitions[S](\n  f: Iterator[T] => Iterator[S],\n  preservesPartitioning: Boolean = false): RDD[S]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Under the covers, RDDBarrier.mapPartitions creates a MapPartitionsRDD like the regular RDD.mapPartitions transformation but with isFromBarrier flag enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Task has a isBarrier flag that says whether this task belongs to a barrier stage (default: false).
                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"barrier-execution-mode/#isFromBarrier","title":"isFromBarrier Flag","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            An RDD is in a barrier stage, if at least one of its parent RDD(s), or itself, are mapped from an RDDBarrier.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            ShuffledRDD has the isBarrier flag always disabled (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                            MapPartitionsRDD is the only RDD that can have the isBarrier flag enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            RDDBarrier.mapPartitions is the only transformation that creates a MapPartitionsRDD with the isFromBarrier flag enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"barrier-execution-mode/#unsupported-spark-features","title":"Unsupported Spark Features","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            The following Spark features are not supported:

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Push-Based Shuffle
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Dynamic Allocation of Executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"barrier-execution-mode/#demo","title":"Demo","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Enable ALL logging level for org.apache.spark.BarrierTaskContext logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            val tasksNum = 3\nval nums = sc.parallelize(seq = 0 until 9, numSlices = tasksNum)\nassert(nums.getNumPartitions == tasksNum)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Print out the available partitions and the number of records within each (using Spark SQL for a human-friendlier output).

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Scala
                                                                                                                                                                                                                                                                                                                                                                                                                                                            import org.apache.spark.TaskContext\nnums\n  .mapPartitions { it => Iterator.single((TaskContext.get.partitionId, it.size)) }\n  .toDF(\"partitionId\", \"size\")\n  .show\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                            +-----------+----+\n|partitionId|size|\n+-----------+----+\n|          0|   3|\n|          1|   3|\n|          2|   3|\n+-----------+----+\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"barrier-execution-mode/#distributed-training","title":"Distributed Training","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            RDD.barrier creates a Barrier Stage (a RDDBarrier).

                                                                                                                                                                                                                                                                                                                                                                                                                                                            import org.apache.spark.rdd.RDDBarrier\nassert(nums.barrier.isInstanceOf[RDDBarrier[_]])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Use RDD.mapPartitions transformation to access a BarrierTaskContext.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            val barrierRdd = nums\n  .barrier\n  .mapPartitions { ns =>\n    import org.apache.spark.{BarrierTaskContext, TaskContext}\n    val ctx = TaskContext.get.asInstanceOf[BarrierTaskContext]\n    val tid = ctx.partitionId()\n    val port = 10000 + tid\n    val host = \"localhost\"\n    val message = s\"A message from task $tid, e.g. $host:$port it listens at\"\n    val allTaskMessages = ctx.allGather(message)\n\n    if (tid == 0) { // only Task 0 prints out status\n      println(\">>> Got host:port's from the other tasks\")\n      allTaskMessages.foreach(println)\n    }\n\n    if (tid == 0) { // only Task 0 prints out status\n      println(\">>> Starting a distributed training at the nodes...\")\n    }\n\n    ctx.barrier() // this is BarrierTaskContext.barrier (not RDD.barrier)\n                  // which can be confusing\n\n    if (tid == 0) { // only Task 0 prints out status\n      println(\">>> All tasks have finished\")\n    }\n\n    // return a model after combining (model) pieces from the nodes\n    ns\n  }\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Run a distributed computation (using RDD.count action).

                                                                                                                                                                                                                                                                                                                                                                                                                                                            barrierRdd.count()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                            There should be INFO and TRACE messages printed out to the console (given ALL logging level for org.apache.spark.BarrierTaskContext logger).

                                                                                                                                                                                                                                                                                                                                                                                                                                                            [Executor task launch worker for task 1.0 in stage 5.0 (TID 13)] INFO  org.apache.spark.BarrierTaskContext:60 - Task 13 from Stage 5(Attempt 0) has entered the global sync, current barrier epoch is 0.\n...\n[Executor task launch worker for task 1.0 in stage 5.0 (TID 13)] TRACE org.apache.spark.BarrierTaskContext:68 - Current callSite: CallSite($anonfun$runBarrier$2 at Logging.scala:68,org.apache.spark.BarrierTaskContext.$anonfun$runBarrier$2(BarrierTaskContext.scala:61)\n...\n[Executor task launch worker for task 1.0 in stage 5.0 (TID 13)] INFO  org.apache.spark.BarrierTaskContext:60 - Task 13 from Stage 5(Attempt 0) finished global sync successfully, waited for 1 seconds, current barrier epoch is 1.\n...\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Open up web UI and explore the execution plans.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"barrier-execution-mode/#access-mappartitionsrdd","title":"Access MapPartitionsRDD","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            MapPartitionsRDD is a private[spark] class so to access RDD.isBarrier method requires to be in org.apache.spark package.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Paste the following code in spark-shell / Scala REPL using :paste -raw mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            package org.apache.spark\n\nobject IsBarrier {\n  import org.apache.spark.rdd.RDD\n  implicit class BypassPrivateSpark[T](rdd: RDD[T]) {\n    def isBarrier = rdd.isBarrier\n  }\n}\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                            import org.apache.spark.IsBarrier._\nassert(barrierRdd.isBarrier)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"barrier-execution-mode/#examples","title":"Examples","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Something worth reviewing the source code and learn from it

                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"barrier-execution-mode/#synapseml","title":"SynapseML","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            SynapseML's LightGBM on Apache Spark can be configured to use Barrier Execution Mode in the following modules:

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • synapse.ml.lightgbm.LightGBMClassifier
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • synapse.ml.lightgbm.LightGBMRanker
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • synapse.ml.lightgbm.LightGBMRegressor
                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"barrier-execution-mode/#xgboost4j","title":"XGBoost4J","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            XGBoost4J is the JVM package of xgboost (an optimized distributed gradient boosting library with machine learning algorithms for regression and classification under the Gradient Boosting framework).

                                                                                                                                                                                                                                                                                                                                                                                                                                                            The heart of distributed training in xgboost4j-spark (that can run distributed xgboost on Apache Spark) is XGBoost.trainDistributed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            There's a familiar line that creates a barrier stage (using RDD.barrier()):

                                                                                                                                                                                                                                                                                                                                                                                                                                                            val boostersAndMetrics = trainingRDD.barrier().mapPartitions {\n  // distributed training using XGBoost happens here\n}\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                            The barrier mapPartitions block finishes is followed by RDD.collect() that gets XGBoost4J-specific metadata (booster and metrics):

                                                                                                                                                                                                                                                                                                                                                                                                                                                            val (booster, metrics) = boostersAndMetrics.collect()(0)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Within the barrier stage (within mapPartitions block), xgboost4j-spark builds a distributed booster:

                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. Checkpointing, when enabled, happens only by Task 0
                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. All tasks initialize so-called collective Communicator for synchronization
                                                                                                                                                                                                                                                                                                                                                                                                                                                            3. xgboost4j-spark uses XGBoostJNI to talk to XGBoost using JNI
                                                                                                                                                                                                                                                                                                                                                                                                                                                            4. Only Task 0 returns non-empty iterator (and that's why the RDD.collect()(0) gets (booster, metrics))
                                                                                                                                                                                                                                                                                                                                                                                                                                                            5. All tasks execute SXGBoost.train that eventually leads to XGBoost.trainAndSaveCheckpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"barrier-execution-mode/#learn-more","title":"Learn More","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. SPIP: Support Barrier Execution Mode in Apache Spark (esp. Design: Barrier execution mode)
                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. Barrier Execution Mode in Spark 3.0 - Part 1 : Introduction
                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"barrier-execution-mode/BarrierCoordinator/","title":"Barrier Coordinator RPC Endpoint","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            BarrierCoordinator is a ThreadSafeRpcEndpoint that is registered as barrierSync RPC Endpoint when TaskSchedulerImpl is requested to maybeInitBarrierCoordinator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            BarrierCoordinator is responsible for handling RequestToSync messages to coordinate Global Syncs of barrier tasks (using allGather and barrier operators).

                                                                                                                                                                                                                                                                                                                                                                                                                                                            In other words, the driver sets up a BarrierCoordinator (TaskSchedulerImpl precisely) upon startup that BarrierTaskContexts talk to using RequestToSync messages. BarrierCoordinator tracks the number of tasks to wait for until a barrier stage is complete and a response can be sent back to the tasks to continue (that are paused for 365 days (!)).

                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"barrier-execution-mode/BarrierCoordinator/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                            BarrierCoordinator takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Timeout (seconds)
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LiveListenerBus
                                                                                                                                                                                                                                                                                                                                                                                                                                                            • RpcEnv

                                                                                                                                                                                                                                                                                                                                                                                                                                                              BarrierCoordinator is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                              • TaskSchedulerImpl is requested to maybeInitBarrierCoordinator
                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"barrier-execution-mode/BarrierCoordinator/#receiveAndReply","title":"Processing RequestToSync Messages (from Barrier Tasks)","text":"RpcEndpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                              receiveAndReply(\n  context: RpcCallContext): PartialFunction[Any, Unit]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                              receiveAndReply is part of the RpcEndpoint abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              receiveAndReply handles RequestToSync messages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              Unless already registered, receiveAndReply registers a new ContextBarrierId (for the stageId and the stageAttemptId) in the Barrier States registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              Multiple Tasks and One BarrierCoordinator

                                                                                                                                                                                                                                                                                                                                                                                                                                                              receiveAndReply handles RequestToSync messages, one per task in a barrier stage. Out of all the properties of RequestToSync, numTasks, stageId and stageAttemptId are used.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              The very first RequestToSync is used to register the stageId and stageAttemptId (as ContextBarrierId) with numTasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              receiveAndReply finds the ContextBarrierState for the stage and the stage attempt (in the Barrier States registry) to handle the RequestToSync.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"barrier-execution-mode/BarrierCoordinator/#states","title":"Barrier States","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                              states: ConcurrentHashMap[ContextBarrierId, ContextBarrierState]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                              BarrierCoordinator creates an empty ConcurrentHashMap (Java) when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              states registry is used to keep track of all the active barrier stage attempts and the corresponding internal ContextBarrierState.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              states is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                              • onStop to clean up
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • cleanupBarrierStage to remove a specific stage attempt
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • receiveAndReply to handle RequestToSync messages
                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"barrier-execution-mode/BarrierCoordinator/#listener","title":"SparkListener","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                              BarrierCoordinator creates a SparkListener when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              The SparkListener is used to intercept SparkListenerStageCompleted events.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              The SparkListener is addToStatusQueue upon startup and removed at stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"barrier-execution-mode/BarrierCoordinator/#onStageCompleted","title":"onStageCompleted","text":"SparkListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                              onStageCompleted(\n  stageCompleted: SparkListenerStageCompleted): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                              onStageCompleted is part of the SparkListenerInterface abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              onStageCompleted cleanupBarrierStage for the stage and the attempt number (based on the given SparkListenerStageCompleted).

                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"barrier-execution-mode/BarrierCoordinator/#logging","title":"Logging","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                              Enable ALL logging level for org.apache.spark.BarrierCoordinator logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                              logger.BarrierCoordinator.name = org.apache.spark.BarrierCoordinator\nlogger.BarrierCoordinator.level = all\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                              Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"barrier-execution-mode/BarrierCoordinatorMessage/","title":"BarrierCoordinatorMessage RPC Messages","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                              BarrierCoordinatorMessage is an abstraction of RPC messages that tasks can send out using BarrierTaskContext operators for BarrierCoordinator to handle.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              BarrierCoordinatorMessage is a Serializable (Java) (so it can be sent from executors to the driver over the wire).

                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"barrier-execution-mode/BarrierCoordinatorMessage/#implementations","title":"Implementations","text":"Sealed Trait

                                                                                                                                                                                                                                                                                                                                                                                                                                                              BarrierCoordinatorMessage is a Scala sealed trait which means that all of the implementations are in the same compilation unit (a single file).

                                                                                                                                                                                                                                                                                                                                                                                                                                                              Learn more in the Scala Language Specification.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              • RequestToSync
                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"barrier-execution-mode/BarrierJobAllocationFailed/","title":"BarrierJobAllocationFailed","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                              BarrierJobAllocationFailed is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"barrier-execution-mode/BarrierJobSlotsNumberCheckFailed/","title":"BarrierJobSlotsNumberCheckFailed","text":""},{"location":"barrier-execution-mode/BarrierJobSlotsNumberCheckFailed/#barrierjobslotsnumbercheckfailed","title":"BarrierJobSlotsNumberCheckFailed","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                              BarrierJobSlotsNumberCheckFailed is a BarrierJobAllocationFailed with the following exception message:

                                                                                                                                                                                                                                                                                                                                                                                                                                                              [SPARK-24819]: Barrier execution mode does not allow run a barrier stage that requires more slots than the total number of slots in the cluster currently.\nPlease init a new cluster with more resources(e.g. CPU, GPU) or repartition the input RDD(s) to reduce the number of slots required to run this barrier stage.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                              BarrierJobSlotsNumberCheckFailed can be thrown when DAGScheduler is requested to handle a JobSubmitted event.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"barrier-execution-mode/BarrierJobSlotsNumberCheckFailed/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                              BarrierJobSlotsNumberCheckFailed takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Required Concurrent Tasks (based on the number of partitions of a barrier RDD)
                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Maximum Number of Concurrent Tasks (based on a ResourceProfile used)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                BarrierJobSlotsNumberCheckFailed is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkCoreErrors is requested to numPartitionsGreaterThanMaxNumConcurrentTasksError
                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"barrier-execution-mode/BarrierTaskContext/","title":"BarrierTaskContext \u2014 TaskContext for Barrier Tasks","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                BarrierTaskContext is a concrete TaskContext of the tasks in a Barrier Stage in Barrier Execution Mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"barrier-execution-mode/BarrierTaskContext/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                BarrierTaskContext takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                • TaskContext

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BarrierTaskContext is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Task is requested to run (with isBarrier flag enabled)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"barrier-execution-mode/BarrierTaskContext/#barrierCoordinator","title":"Barrier Coordinator RPC Endpoint","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  barrierCoordinator: RpcEndpointRef\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BarrierTaskContext creates a RpcEndpointRef to Barrier Coordinator RPC Endpoint when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  barrierCoordinator is used to handle barrier and allGather operators (through runBarrier).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"barrier-execution-mode/BarrierTaskContext/#allGather","title":"allGather","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  allGather(\n  message: String): Array[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  allGather runBarrier with the given message and ALL_GATHER request method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Public API and PySpark

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  allGather is part of a public API.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  allGather is used in BasePythonRunner.WriterThread (PySpark) when requested to barrierAndServe.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"barrier-execution-mode/BarrierTaskContext/#barrier","title":"barrier","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  barrier(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  barrier runBarrier with no message and BARRIER request method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Public API and PySpark

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  barrier is part of a public API.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  barrier is used in BasePythonRunner.WriterThread (PySpark) when requested to barrierAndServe.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"barrier-execution-mode/BarrierTaskContext/#runBarrier","title":"Global Sync","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  runBarrier(\n  message: String,\n  requestMethod: RequestMethod.Value): Array[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  runBarrier prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Task [taskAttemptId] from Stage [stageId](Attempt [stageAttemptNumber]) has entered the global sync, current barrier epoch is [barrierEpoch].\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  runBarrier prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Current callSite: [callSite]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  runBarrier schedules a TimerTask (Java) to print out the following INFO message to the logs every minute:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Task [taskAttemptId] from Stage [stageId](Attempt [stageAttemptNumber]) waiting under the global sync since [startTime],\nhas been waiting for [duration] seconds,\ncurrent barrier epoch is [barrierEpoch].\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  runBarrier requests the Barrier Coordinator RPC Endpoint to send a RequestToSync one-off message and waits 365 days (!) for a response (a collection of responses from all the barrier tasks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1 Year to Wait for Response from Barrier Coordinator

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  runBarrier uses 1 year to wait until the response arrives.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  runBarrier checks every second if the response \"bundle\" arrived.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  runBarrier increments the barrierEpoch.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  runBarrier prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Task [taskAttemptId] from Stage [stageId](Attempt [stageAttemptNumber]) finished global sync successfully,\nwaited for [duration] seconds,\ncurrent barrier epoch is [barrierEpoch].\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, runBarrier returns the response \"bundle\" (a collection of responses from all the barrier tasks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In case of a SparkException, runBarrier prints out the following INFO message to the logs and reports (re-throws) the exception up (the call chain):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Task [taskAttemptId] from Stage [stageId](Attempt [stageAttemptNumber]) failed to perform global sync,\nwaited for [duration] seconds,\ncurrent barrier epoch is [barrierEpoch].\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  runBarrier is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BarrierTaskContext is requested to barrier, allGather
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"barrier-execution-mode/BarrierTaskContext/#logging","title":"Logging","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Enable ALL logging level for org.apache.spark.BarrierTaskContext logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  logger.BarrierTaskContext.name = org.apache.spark.BarrierTaskContext\nlogger.BarrierTaskContext.level = all\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"barrier-execution-mode/ContextBarrierState/","title":"ContextBarrierState","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ContextBarrierState represents the state of global sync of a barrier stage (with the number of tasks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ContextBarrierState is used by BarrierCoordinator to handle RequestToSync messages (and to keep track of active barrier stage attempts).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ContextBarrierState

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ContextBarrierState is a private class of BarrierCoordinator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"barrier-execution-mode/ContextBarrierState/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ContextBarrierState takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ContextBarrierId
                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Number of Tasks (of a barrier stage)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ContextBarrierState is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BarrierCoordinator is requested to handle a RequestToSync message for a new stage and stage attempt IDs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"barrier-execution-mode/ContextBarrierState/#barrierId","title":"Barrier Stage Attempt (ContextBarrierId)","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ContextBarrierState is given a ContextBarrierId (of a barrier stage) when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The ContextBarrierId uniquely identifies a barrier stage by the stage and stage attempt IDs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"barrier-execution-mode/ContextBarrierState/#barrierEpoch","title":"Barrier Epoch","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ContextBarrierState initializes barrierEpoch counter to be 0 when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"barrier-execution-mode/ContextBarrierState/#requesters","title":"Barrier Tasks","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    requesters: ArrayBuffer[RpcCallContext]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    requesters is a registry of RpcCallContexts of the barrier tasks (of a barrier stage attempt) pending a reply.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    It is only when the number of RpcCallContexts in the requesters reaches the number of tasks expected (while handling RequestToSync requests) that this ContextBarrierState is considered finished successfully.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ContextBarrierState initializes requesters when created to be of number of tasks size.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    A new RpcCallContext of a barrier task is added in handleRequest only when the epoch of the barrier task matches the current barrierEpoch.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"barrier-execution-mode/ContextBarrierState/#timerTask","title":"TimerTask","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    timerTask: TimerTask\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ContextBarrierState uses a TimerTask (Java) to ensure that a barrier() call can time out.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ContextBarrierState creates a TimerTask (Java) when requested to initTimerTask when requested to handle a RequestToSync message for the first global sync message received (when the requesters is empty). The TimerTask is then immediately scheduled to be executed after spark.barrier.sync.timeout.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    spark.barrier.sync.timeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Since spark.barrier.sync.timeout defaults to 365d (1 year), the TimerTask will run only after one year.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The TimerTask is stopped in cancelTimerTask.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"barrier-execution-mode/ContextBarrierState/#initTimerTask","title":"Initializing TimerTask","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    initTimerTask(\n  state: ContextBarrierState): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    initTimerTask creates a new TimerTask (Java) that, when executed, sends a SparkException to all the requesters with the following message followed by cleanupBarrierStage for this ContextBarrierId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The coordinator didn't get all barrier sync requests\nfor barrier epoch [barrierEpoch] from [barrierId] within [timeoutInSecs] second(s).\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The TimerTask is made available as timerTask.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    initTimerTask is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ContextBarrierState is requested to handle a RequestToSync message (for the first global sync message received when the requesters is empty)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"barrier-execution-mode/ContextBarrierState/#messages","title":"messages","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ContextBarrierState initializes messages registry of messages from all numTasks barrier tasks (of a barrier stage attempt) when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    messages registry is empty.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    A new message is registered (added) when handling a RequestToSync request.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"barrier-execution-mode/ContextBarrierState/#handleRequest","title":"Handling RequestToSync Message","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    handleRequest(\n  requester: RpcCallContext,\n  request: RequestToSync): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    handleRequest makes sure that the RequestMethod (of the given RequestToSync) is consistent across barrier tasks (using requestMethods registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    handleRequest asserts that the number of tasks is this numTasks, and so consistent across barrier tasks. Otherwise, handleRequest reports IllegalArgumentException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Number of tasks of [barrierId] is [numTasks] from Task [taskId], previously it was [numTasks].\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    handleRequest prints out the following INFO message to the logs (with the ContextBarrierId and barrierEpoch):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Current barrier epoch for [barrierId] is [barrierEpoch].\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    For the first sync message received (requesters is empty), handleRequest initializes the TimerTask and schedules it for execution after the timeoutInSecs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Timeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Starting the timerTask ensures that a sync may eventually time out (after a configured delay).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    handleRequest registers the given requester in the requesters.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    handleRequest registers the message of the RequestToSync in the messages for the partitionId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    handleRequest prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Barrier sync epoch [barrierEpoch] from [barrierId] received update from Task taskId,\ncurrent progress: [requesters]/[numTasks].\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"barrier-execution-mode/ContextBarrierState/#updates-from-all-barrier-tasks-received","title":"Updates from All Barrier Tasks Received","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    When the barrier sync received updates from all barrier tasks (i.e., the number of requesters is the numTasks), handleRequest replies back to all the requesters with the messages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    handleRequest prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Barrier sync epoch [barrierEpoch] from [barrierId] received all updates from tasks,\nfinished successfully.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    handleRequest increments the barrierEpoch, clears the requesters and the requestMethods, and then cancelTimerTask.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In case of the epoch of the given RequestToSync being different from this barrierEpoch, handleRequest sends back a failure message (with a SparkException) to the given requester:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The request to sync of [barrierId] with barrier epoch [barrierEpoch] has already finished.\nMaybe task [taskId] is not properly killed.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In case of different RequestMethods (in requestMethods registry), handleRequest sends back a failure message to the requesters (incl. the given requester):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Different barrier sync types found for the sync [barrierId]: [requestMethods].\nPlease use the same barrier sync type within a single sync.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    handleRequest clear.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    handleRequest is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BarrierCoordinator is requested to handle a RequestToSync message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"barrier-execution-mode/ContextBarrierState/#logging","title":"Logging","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ContextBarrierState is a private class of BarrierCoordinator and logging is configured using the logger of BarrierCoordinator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"barrier-execution-mode/RDDBarrier/","title":"RDDBarrier","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    RDDBarrier is a wrapper around RDD with two custom map transformations:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • mapPartitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • mapPartitionsWithIndex

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Unlike regular RDD.mapPartitions transformations, RDDBarrier transformations create a MapPartitionsRDD with isFromBarrier flag enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    RDDBarrier (of T records) marks the current stage as a barrier stage in Barrier Execution Mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"barrier-execution-mode/RDDBarrier/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    RDDBarrier takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • RDD (of T records)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      RDDBarrier is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • RDD.barrier transformation is used
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"barrier-execution-mode/RequestMethod/","title":"RequestMethod","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      RequestMethod represents the allowed request methods of RequestToSyncs (that are sent out from barrier tasks using BarrierTaskContext).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ContextBarrierState tracks RequestMethods (from tasks inside a barrier sync) to make sure that the tasks are all part of a legitimate barrier sync. All tasks should make sure that they're calling the same method within the same barrier sync phase.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"barrier-execution-mode/RequestMethod/#BARRIER","title":"BARRIER","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Marks execution of BarrierTaskContext.barrier

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"barrier-execution-mode/RequestMethod/#ALL_GATHER","title":"ALL_GATHER","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Marks execution of BarrierTaskContext.allGather

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"barrier-execution-mode/RequestToSync/","title":"RequestToSync RPC Message","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      RequestToSync is a BarrierCoordinatorMessage to start Global Sync phase.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      RequestToSync is sent out from BarrierTaskContext (i.e., barrier tasks on executors) to a BarrierCoordinator (on the driver) to handle.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Operation Message Request Message allGather User-defined message ALL_GATHER barrier empty BARRIER"},{"location":"barrier-execution-mode/RequestToSync/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      RequestToSync takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Number of tasks (partitions)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Stage ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Stage Attempt ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Task Attempt ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • BarrierEpoch
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Partition ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • RequestMethod

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        RequestToSync is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BarrierTaskContext is requested for a Global Sync
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"broadcast-variables/","title":"Broadcast Variables","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        From the official documentation about Broadcast Variables:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        And later in the document:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark uses SparkContext to create broadcast variables and BroadcastManager with ContextCleaner to manage their lifecycle.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Not only can Spark developers use broadcast variables for efficient data distribution, but Spark itself uses them quite often too. A very notable use case is when Spark distributes tasks (to executors) for execution.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The idea is to transfer values used in transformations from a driver to executors in a most effective way so they are copied once and used many times by tasks (rather than being copied every time a task is launched).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"broadcast-variables/#lifecycle-of-broadcast-variable","title":"Lifecycle of Broadcast Variable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Broadcast variables (TorrentBroadcasts, actually) are created using SparkContext.broadcast method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> val b = sc.broadcast(1)\nb: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(0)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Tip

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Enable DEBUG logging level for org.apache.spark.storage.BlockManager logger to debug broadcast method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        With DEBUG logging level enabled, there should be the following messages printed out to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Put block broadcast_0 locally took  430 ms\nPutting block broadcast_0 without replication took  431 ms\nTold master about block broadcast_0_piece0\nPut block broadcast_0_piece0 locally took  4 ms\nPutting block broadcast_0_piece0 without replication took  4 ms\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        A broadcast variable is stored on the driver's BlockManager as a single value and separately as chunks (of spark.broadcast.blockSize).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When requested for the broadcast value, TorrentBroadcast reads the broadcast block from the local BroadcastManager and, if fails, from the local BlockManager. Only when the local lookups fail, TorrentBroadcast reads the broadcast block chunks (from the BlockMannagers on the other executors), persists them as a single broadcast variable (in the local BlockManager) and caches in BroadcastManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> b.value\nres0: Int = 1\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Broadcast.value is the only way to access the value of a broadcast variable in a Spark transformation. You can only access the broadcast value any time until the broadcast variable is destroyed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        With DEBUG logging level enabled, there should be the following messages printed out to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Getting local block broadcast_0\nLevel for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the end, broadcast variables should be destroyed to release memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        b.destroy\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        With DEBUG logging level enabled, there should be the following messages printed out to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Removing broadcast 0\nRemoving block broadcast_0_piece0\nTold master about block broadcast_0_piece0\nRemoving block broadcast_0\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Broadcast variables can optionally be unpersisted.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        b.unpersist\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"broadcast-variables/#introduction","title":"Introduction

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        You use broadcast variable to implement map-side join, i.e. a join using a map. For this, lookup tables are distributed across nodes in a cluster using broadcast and then looked up inside map (to do the join implicitly).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When you broadcast a value, it is copied to executors only once (while it is copied multiple times for tasks otherwise). It means that broadcast can help to get your Spark application faster if you have a large value to use in tasks or there are more tasks than executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        It appears that a Spark idiom emerges that uses broadcast with collectAsMap to create a Map for broadcast. When an RDD is map over to a smaller dataset (column-wise not record-wise), collectAsMap, and broadcast, using the very big RDD to map its elements to the broadcast RDDs is computationally faster.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        val acMap = sc.broadcast(myRDD.map { case (a,b,c,b) => (a, c) }.collectAsMap)\nval otherMap = sc.broadcast(myOtherRDD.collectAsMap)\n\nmyBigRDD.map { case (a, b, c, d) =>\n  (acMap.value.get(a).get, otherMap.value.get(c).get)\n}.collect\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Use large broadcasted HashMaps over RDDs whenever possible and leave RDDs with a key to lookup necessary data as demonstrated above.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"broadcast-variables/#demo","title":"Demo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        You're going to use a static mapping of interesting projects with their websites, i.e. Map[String, String] that the tasks, i.e. closures (anonymous functions) in transformations, use.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        val pws = Map(\n  \"Apache Spark\" -> \"http://spark.apache.org/\",\n  \"Scala\" -> \"http://www.scala-lang.org/\")\n\nval websites = sc.parallelize(Seq(\"Apache Spark\", \"Scala\")).map(pws).collect\n// websites: Array[String] = Array(http://spark.apache.org/, http://www.scala-lang.org/)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        It works, but is very ineffective as the pws map is sent over the wire to executors while it could have been there already. If there were more tasks that need the pws map, you could improve their performance by minimizing the number of bytes that are going to be sent over the network for task execution.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Enter broadcast variables.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        val pwsB = sc.broadcast(pws)\nval websites = sc.parallelize(Seq(\"Apache Spark\", \"Scala\")).map(pwsB.value).collect\n// websites: Array[String] = Array(http://spark.apache.org/, http://www.scala-lang.org/)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Semantically, the two computations - with and without the broadcast value - are exactly the same, but the broadcast-based one wins performance-wise when there are more executors spawned to execute many tasks that use pws map.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"broadcast-variables/#further-reading-or-watching","title":"Further Reading or Watching
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Map-Side Join in Spark
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"broadcast-variables/Broadcast/","title":"Broadcast","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Broadcast[T] is an abstraction of broadcast variables (with the value of type T).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"broadcast-variables/Broadcast/#contract","title":"Contract","text":""},{"location":"broadcast-variables/Broadcast/#destroying-variable","title":"Destroying Variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        doDestroy(\n  blocking: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Destroys all the data and metadata related to this broadcast variable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Broadcast is requested to destroy
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"broadcast-variables/Broadcast/#unpersisting-variable","title":"Unpersisting Variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        doUnpersist(\n  blocking: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Deletes the cached copies of this broadcast value on executors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Broadcast is requested to unpersist
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"broadcast-variables/Broadcast/#broadcast-value","title":"Broadcast Value
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getValue(): T\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Gets the broadcast value

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Broadcast is requested for the value
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"broadcast-variables/Broadcast/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TorrentBroadcast
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"broadcast-variables/Broadcast/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Broadcast takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Unique Identifier Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Broadcast\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete Broadcasts.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"broadcast-variables/Broadcast/#serializable","title":"Serializable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Broadcast is a Serializable (Java) so it can be serialized (converted to bytes) and send over the wire from the driver to executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"broadcast-variables/Broadcast/#destroying","title":"Destroying
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          destroy(): Unit // (1)\ndestroy(\n  blocking: Boolean): Unit\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          1. Non-blocking destroy (blocking is false)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          destroy removes persisted data and metadata associated with this broadcast variable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Once a broadcast variable has been destroyed, it cannot be used again.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"broadcast-variables/Broadcast/#unpersisting","title":"Unpersisting
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          unpersist(): Unit // (1)\nunpersist(\n  blocking: Boolean): Unit\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          1. Non-blocking unpersist (blocking is false)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          unpersist...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"broadcast-variables/Broadcast/#brodcast-value","title":"Brodcast Value
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          value: T\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          value makes sure that it was not destroyed and gets the value.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"broadcast-variables/Broadcast/#text-representation","title":"Text Representation
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          toString: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          toString uses the id as follows:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Broadcast([id])\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"broadcast-variables/Broadcast/#validation","title":"Validation

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Broadcast is considered valid until destroyed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Broadcast throws a SparkException (with the text representation) when destroyed but requested for the value, to unpersist or destroy:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Attempted to use [toString] after it was destroyed ([destroySite])\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"broadcast-variables/BroadcastFactory/","title":"BroadcastFactory","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          BroadcastFactory is an abstraction of broadcast variable factories that BroadcastManager uses to create or delete (unbroadcast) broadcast variables.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"broadcast-variables/BroadcastFactory/#contract","title":"Contract","text":""},{"location":"broadcast-variables/BroadcastFactory/#initialize","title":"Initializing","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          initialize(\n  isDriver: Boolean,\n  conf: SparkConf): Unit\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Procedure

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          initialize is a procedure (returns Unit) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          See:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TorrentBroadcastFactory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BroadcastManager is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"broadcast-variables/BroadcastFactory/#newBroadcast","title":"Creating Broadcast Variable","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          newBroadcast[T: ClassTag](\n  value: T,\n  isLocal: Boolean,\n  id: Long,\n  serializedOnly: Boolean = false): Broadcast[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          See:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TorrentBroadcastFactory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BroadcastManager is requested for a new broadcast variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"broadcast-variables/BroadcastFactory/#stop","title":"Stopping","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          stop(): Unit\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Procedure

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          stop is a procedure (returns Unit) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          See:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TorrentBroadcastFactory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BroadcastManager is requested to stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"broadcast-variables/BroadcastFactory/#unbroadcast","title":"Deleting Broadcast Variable","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          unbroadcast(\n  id: Long,\n  removeFromDriver: Boolean,\n  blocking: Boolean): Unit\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Procedure

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          unbroadcast is a procedure (returns Unit) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          See:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TorrentBroadcastFactory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BroadcastManager is requested to delete a broadcast variable (unbroadcast)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"broadcast-variables/BroadcastFactory/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TorrentBroadcastFactory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"broadcast-variables/BroadcastManager/","title":"BroadcastManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          BroadcastManager manages a TorrentBroadcastFactory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          As of Spark 2.0, it is no longer possible to plug a custom BroadcastFactory in, and TorrentBroadcastFactory is the only known implementation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"broadcast-variables/BroadcastManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          BroadcastManager takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • isDriver flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SecurityManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            While being created, BroadcastManager is requested to initialize.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BroadcastManager is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkEnv utility is used to create a base SparkEnv (for the driver and executors)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"broadcast-variables/BroadcastManager/#initializing","title":"Initializing
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            initialize(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Unless initialized already, initialize creates a TorrentBroadcastFactory and requests it to initialize itself.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"broadcast-variables/BroadcastManager/#torrentbroadcastfactory","title":"TorrentBroadcastFactory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BroadcastManager manages a BroadcastFactory:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Creates and initializes it when created (and requested to initialize)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Stops it when stopped

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BroadcastManager uses the BroadcastFactory when requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Creating a new broadcast variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Deleting a broadcast variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"broadcast-variables/BroadcastManager/#creating-broadcast-variable","title":"Creating Broadcast Variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            newBroadcast(\n  value_ : T,\n  isLocal: Boolean): Broadcast[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            newBroadcast requests the BroadcastFactory for a new broadcast variable (with the next available broadcast ID).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            newBroadcast\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkContext is requested for a new broadcast variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MapOutputTracker utility is used to serializeMapStatuses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"broadcast-variables/BroadcastManager/#unique-identifiers-of-broadcast-variables","title":"Unique Identifiers of Broadcast Variables

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BroadcastManager tracks broadcast variables and assigns unique and continuous identifiers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"broadcast-variables/BroadcastManager/#mapoutputtrackermaster","title":"MapOutputTrackerMaster

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BroadcastManager is used to create a MapOutputTrackerMaster

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"broadcast-variables/BroadcastManager/#deleting-broadcast-variable","title":"Deleting Broadcast Variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            unbroadcast(\n  id: Long,\n  removeFromDriver: Boolean,\n  blocking: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            unbroadcast requests the BroadcastFactory to delete a broadcast variable (by id).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            unbroadcast\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ContextCleaner is requested to clean up a broadcast variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"broadcast-variables/TorrentBroadcast/","title":"TorrentBroadcast","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TorrentBroadcast is a Broadcast that uses a BitTorrent-like protocol for broadcast blocks distribution.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"broadcast-variables/TorrentBroadcast/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TorrentBroadcast takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Broadcast Value (of type T)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Identifier

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              TorrentBroadcast is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • TorrentBroadcastFactory is requested for a new broadcast variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"broadcast-variables/TorrentBroadcast/#broadcastblockid","title":"BroadcastBlockId

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              TorrentBroadcast creates a BroadcastBlockId (with the id) when created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#number-of-block-chunks","title":"Number of Block Chunks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              TorrentBroadcast uses numBlocks for the number of blocks of a broadcast variable (that was blockified into when created).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#transient-lazy-broadcast-value","title":"Transient Lazy Broadcast Value
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              _value: T\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              TorrentBroadcast uses _value transient registry for the value that is computed on demand (and cached afterwards).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              _value is a @transient private lazy val and uses the following Scala language features:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1. It is not serialized when the TorrentBroadcast is serialized to be sent over the wire to executors (and has to be re-computed afterwards)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              2. It is lazily instantiated when first requested and cached afterwards
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#value","title":"Value
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getValue(): T\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getValue uses the _value transient registry for the value if available (non-null).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Otherwise, getValue reads the broadcast block (from the local BroadcastManager, BlockManager or falls back to readBlocks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getValue saves the object in the _value registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getValue\u00a0is part of the Broadcast abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#reading-broadcast-block","title":"Reading Broadcast Block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readBroadcastBlock(): T\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readBroadcastBlock looks up the BroadcastBlockId in (the cache of) BroadcastManager and returns the value if found.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Otherwise, readBroadcastBlock setConf and requests the BlockManager for the locally-stored broadcast data.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              If the broadcast block is found locally, readBroadcastBlock requests the BroadcastManager to cache it and returns the value.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              If not found locally, readBroadcastBlock multiplies the numBlocks by the blockSize for an estimated size of the broadcast block. readBroadcastBlock prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Started reading broadcast variable [id] with [numBlocks] pieces\n(estimated total size [estimatedTotalSize])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readBroadcastBlock readBlocks and prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Reading broadcast variable [id] took [time] ms\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readBroadcastBlock unblockifies the block chunks into an object (using the Serializer and the CompressionCodec).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readBroadcastBlock requests the BlockManager to store the merged copy (so other tasks on this executor don't need to re-fetch it). readBroadcastBlock uses MEMORY_AND_DISK storage level and the tellMaster flag off.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readBroadcastBlock requests the BroadcastManager to cache it and returns the value.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#unblockifying-broadcast-value","title":"Unblockifying Broadcast Value
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              unBlockifyObject(\n  blocks: Array[InputStream],\n  serializer: Serializer,\n  compressionCodec: Option[CompressionCodec]): T\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              unBlockifyObject...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#reading-broadcast-block-chunks","title":"Reading Broadcast Block Chunks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readBlocks(): Array[BlockData]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readBlocks creates a collection of BlockDatas for numBlocks block chunks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              For every block (randomly-chosen by block ID between 0 and numBlocks), readBlocks creates a BroadcastBlockId for the id (of the broadcast variable) and the chunk (identified by the piece prefix followed by the ID).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readBlocks prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Reading piece [pieceId] of [broadcastId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readBlocks first tries to look up the piece locally by requesting the BlockManager to getLocalBytes and, if found, stores the reference in the local block array (for the piece ID).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              If not found in the local BlockManager, readBlocks requests the BlockManager to getRemoteBytes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              With checksumEnabled, readBlocks...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readBlocks requests the BlockManager to store the chunk (so other tasks on this executor don't need to re-fetch it) using MEMORY_AND_DISK_SER storage level and reporting to the driver (so other executors can pull these chunks from this executor as well).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readBlocks creates a ByteBufferBlockData for the chunk (and stores it in the blocks array).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readBlocks throws a SparkException for blocks neither available locally nor remotely:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Failed to get [pieceId] of [broadcastId]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#compressioncodec","title":"CompressionCodec
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              compressionCodec: Option[CompressionCodec]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              TorrentBroadcast uses the spark.broadcast.compress configuration property for the CompressionCodec to use for writeBlocks and readBroadcastBlock.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#broadcast-block-chunk-size","title":"Broadcast Block Chunk Size

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              TorrentBroadcast uses the spark.broadcast.blockSize configuration property for the size of the chunks (pieces) of a broadcast block.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              TorrentBroadcast uses the size for writeBlocks and readBroadcastBlock.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#persisting-broadcast-to-blockmanager","title":"Persisting Broadcast (to BlockManager)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              writeBlocks(\n  value: T): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              writeBlocks returns the number of blocks (chunks) this broadcast variable (was blockified into).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The whole broadcast value is stored in the local BlockManager with MEMORY_AND_DISK storage level while the block chunks with MEMORY_AND_DISK_SER storage level.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              writeBlocks\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • TorrentBroadcast is created (that happens on the driver only)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              writeBlocks requests the BlockManager to store the given broadcast value (to be identified as the broadcastId and with the MEMORY_AND_DISK storage level).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              writeBlocks blockify the object (into chunks of the block size, the Serializer, and the optional compressionCodec).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              With checksumEnabled writeBlocks...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              For every block, writeBlocks creates a BroadcastBlockId for the id and piece[index] identifier, and requests the BlockManager to store the chunk bytes (with MEMORY_AND_DISK_SER storage level and reporting to the driver).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#blockifying-broadcast-variable","title":"Blockifying Broadcast Variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              blockifyObject(\n  obj: T,\n  blockSize: Int,\n  serializer: Serializer,\n  compressionCodec: Option[CompressionCodec]): Array[ByteBuffer]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              blockifyObject divides (blockifies) the input obj broadcast value into blocks (ByteBuffer chunks). blockifyObject uses the given Serializer to write the value in a serialized format to a ChunkedByteBufferOutputStream of the given blockSize size with the optional CompressionCodec.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#error-handling","title":"Error Handling

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In case of any error, writeBlocks prints out the following ERROR message to the logs and requests the local BlockManager to remove the broadcast.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Store broadcast [broadcastId] fail, remove all pieces of the broadcast\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In case of an error while storing the value itself, writeBlocks throws a SparkException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Failed to store [broadcastId] in BlockManager\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In case of an error while storing the chunks of the blockified value, writeBlocks throws a SparkException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Failed to store [pieceId] of [broadcastId] in local BlockManager\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#destroying-variable","title":"Destroying Variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doDestroy(\n  blocking: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doDestroy removes the persisted state (associated with the broadcast variable) on all the nodes in a Spark application (the driver and executors).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doDestroy\u00a0is part of the Broadcast abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#unpersisting-variable","title":"Unpersisting Variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doUnpersist(\n  blocking: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doUnpersist removes the persisted state (associated with the broadcast variable) on executors only.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doUnpersist\u00a0is part of the Broadcast abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#removing-persisted-state-broadcast-blocks-of-broadcast-variable","title":"Removing Persisted State (Broadcast Blocks) of Broadcast Variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              unpersist(\n  id: Long,\n  removeFromDriver: Boolean,\n  blocking: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              unpersist prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Unpersisting TorrentBroadcast [id]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, unpersist requests the BlockManagerMaster to remove the blocks of the given broadcast.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              unpersist is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • TorrentBroadcast is requested to unpersist and destroy
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • TorrentBroadcastFactory is requested to unbroadcast
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#setconf","title":"setConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              setConf(\n  conf: SparkConf): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              setConf uses the given SparkConf to initialize the compressionCodec, the blockSize and the checksumEnabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              setConf is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • TorrentBroadcast is created and re-created (when deserialized on executors)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcast/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Enable ALL logging level for org.apache.spark.broadcast.TorrentBroadcast logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              log4j.logger.org.apache.spark.broadcast.TorrentBroadcast=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"broadcast-variables/TorrentBroadcastFactory/","title":"TorrentBroadcastFactory","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              TorrentBroadcastFactory is a BroadcastFactory of TorrentBroadcasts.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              As of Spark 2.0 TorrentBroadcastFactory is the only known BroadcastFactory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"broadcast-variables/TorrentBroadcastFactory/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              TorrentBroadcastFactory takes no arguments to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              TorrentBroadcastFactory is created for BroadcastManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"broadcast-variables/TorrentBroadcastFactory/#newBroadcast","title":"Creating Broadcast Variable","text":"BroadcastFactory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              newBroadcast[T: ClassTag](\n  value_ : T,\n  isLocal: Boolean,\n  id: Long,\n  serializedOnly: Boolean = false): Broadcast[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              newBroadcast\u00a0is part of the BroadcastFactory abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              newBroadcast creates a new TorrentBroadcast with the given value_ and id (and ignoring isLocal).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"broadcast-variables/TorrentBroadcastFactory/#unbroadcast","title":"Deleting Broadcast Variable","text":"BroadcastFactory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              unbroadcast(\n  id: Long,\n  removeFromDriver: Boolean,\n  blocking: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              unbroadcast\u00a0is part of the BroadcastFactory abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              unbroadcast removes all persisted state associated with the broadcast variable (identified by id).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"broadcast-variables/TorrentBroadcastFactory/#initialize","title":"Initializing","text":"BroadcastFactory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              initialize(\n  isDriver: Boolean,\n  conf: SparkConf): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              initialize\u00a0is part of the BroadcastFactory abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              initialize does nothing (noop).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"broadcast-variables/TorrentBroadcastFactory/#stop","title":"Stopping","text":"BroadcastFactory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              stop\u00a0is part of the BroadcastFactory abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              stop does nothing (noop).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/BlockFetchStarter/","title":"BlockFetchStarter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BlockFetchStarter is the <> of...FIXME...to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [[contract]] [[createAndStart]] [source, java]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              void createAndStart(String[] blockIds, BlockFetchingListener listener) throws IOException, InterruptedException;

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              createAndStart is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • NettyBlockTransferService is requested to storage:NettyBlockTransferService.md#fetchBlocks[fetchBlocks] (when network:TransportConf.md#io.maxRetries[maxIORetries] is 0)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • RetryingBlockFetcher is requested to core:RetryingBlockFetcher.md#fetchAllOutstanding[fetchAllOutstanding]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/BlockFetchingListener/","title":"BlockFetchingListener","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BlockFetchingListener\u00a0is an extension of the EventListener (Java) abstraction that want to be notified about block fetch success and failures.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BlockFetchingListener is used to create a OneForOneBlockFetcher, OneForOneBlockPusher and RetryingBlockFetcher.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/BlockFetchingListener/#contract","title":"Contract","text":""},{"location":"core/BlockFetchingListener/#onblockfetchfailure","title":"onBlockFetchFailure
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              void onBlockFetchFailure(\n  String blockId,\n  Throwable exception)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"core/BlockFetchingListener/#onblockfetchsuccess","title":"onBlockFetchSuccess
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              void onBlockFetchSuccess(\n  String blockId,\n  ManagedBuffer data)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"core/BlockFetchingListener/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • \"Unnamed\" in ShuffleBlockFetcherIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • \"Unnamed\" in BlockTransferService
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • RetryingBlockFetchListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/CleanerListener/","title":"CleanerListener","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              = CleanerListener

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              CleanerListener is an abstraction of listeners that can be core:ContextCleaner.md#attachListener[registered with ContextCleaner] to be informed when <>, <>, <>, <> and <> are cleaned.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[rddCleaned]] rddCleaned Callback Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/CleanerListener/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              rddCleaned( rddId: Int): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              rddCleaned is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[broadcastCleaned]] broadcastCleaned Callback Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/CleanerListener/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              broadcastCleaned( broadcastId: Long): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              broadcastCleaned is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[shuffleCleaned]] shuffleCleaned Callback Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/CleanerListener/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              shuffleCleaned( shuffleId: Int, blocking: Boolean): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              shuffleCleaned is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[accumCleaned]] accumCleaned Callback Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/CleanerListener/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              accumCleaned( accId: Long): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              accumCleaned is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[checkpointCleaned]] checkpointCleaned Callback Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/CleanerListener/#source-scala_4","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpointCleaned( rddId: Long): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpointCleaned is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/","title":"ContextCleaner","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ContextCleaner is a Spark service that is responsible for <> (cleanup) of <>, <>, <>, <> and <> that is aimed at reducing the memory requirements of long-running data-heavy Spark applications.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ContextCleaner takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • [[sc]] SparkContext.md[]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ContextCleaner is created and requested to start when SparkContext is created with configuration-properties.md#spark.cleaner.referenceTracking[spark.cleaner.referenceTracking] configuration property enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[cleaningThread]] Spark Context Cleaner Cleaning Thread

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ContextCleaner uses a daemon thread Spark Context Cleaner to clean RDD, shuffle, and broadcast states.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The Spark Context Cleaner thread is started when ContextCleaner is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[listeners]][[attachListener]] CleanerListeners

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ContextCleaner allows attaching core:CleanerListener.md[CleanerListeners] to be informed when objects are cleaned using attachListener method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#sourcescala","title":"[source,scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              attachListener( listener: CleanerListener): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[doCleanupRDD]] doCleanupRDD Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCleanupRDD( rddId: Int, blocking: Boolean): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCleanupRDD...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCleanupRDD is used when ContextCleaner is requested to <> for a CleanRDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[keepCleaning]] keepCleaning Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#source-scala_1","title":"[source, scala]","text":""},{"location":"core/ContextCleaner/#keepcleaning-unit","title":"keepCleaning(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              keepCleaning runs indefinitely until ContextCleaner is requested to <>. keepCleaning...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              keepCleaning prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#got-cleaning-task-task","title":"Got cleaning task [task]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              keepCleaning is used in <> that is started once when ContextCleaner is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[registerRDDCheckpointDataForCleanup]] registerRDDCheckpointDataForCleanup Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerRDDCheckpointDataForCleanupT: Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerRDDCheckpointDataForCleanup...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerRDDCheckpointDataForCleanup is used when ContextCleaner is requested to <> (with configuration-properties.md#spark.cleaner.referenceTracking.cleanCheckpoints[spark.cleaner.referenceTracking.cleanCheckpoints] configuration property enabled).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[registerBroadcastForCleanup]] registerBroadcastForCleanup Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerBroadcastForCleanupT: Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerBroadcastForCleanup...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerBroadcastForCleanup is used when SparkContext is used to SparkContext.md#broadcast[create a broadcast variable].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[registerRDDForCleanup]] registerRDDForCleanup Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#source-scala_4","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerRDDForCleanup( rdd: RDD[_]): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerRDDForCleanup...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerRDDForCleanup is used for rdd:RDD.md#persist[RDD.persist] operation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[registerAccumulatorForCleanup]] registerAccumulatorForCleanup Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#source-scala_5","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerAccumulatorForCleanup( a: AccumulatorV2[_, _]): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerAccumulatorForCleanup...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerAccumulatorForCleanup is used when AccumulatorV2 is requested to register.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[stop]] Stopping ContextCleaner

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#source-scala_6","title":"[source, scala]","text":""},{"location":"core/ContextCleaner/#stop-unit","title":"stop(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              stop...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              stop is used when SparkContext is requested to SparkContext.md#stop[stop].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[start]] Starting ContextCleaner

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#source-scala_7","title":"[source, scala]","text":""},{"location":"core/ContextCleaner/#start-unit","title":"start(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              start starts the <> and an action to request the JVM garbage collector (using System.gc()) on regular basis per configuration-properties.md#spark.cleaner.periodicGC.interval[spark.cleaner.periodicGC.interval] configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The action to request the JVM GC is scheduled on <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              start is used when SparkContext is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[periodicGCService]] periodicGCService Single-Thread Executor Service

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              periodicGCService is an internal single-thread {java-javadoc-url}/java/util/concurrent/ScheduledExecutorService.html[executor service] with the name context-cleaner-periodic-gc to request the JVM garbage collector.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The periodic runs are started when <> and stopped when <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[registerShuffleForCleanup]] Registering ShuffleDependency for Cleanup

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#source-scala_8","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerShuffleForCleanup( shuffleDependency: ShuffleDependency[_, _, _]): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerShuffleForCleanup registers the given ShuffleDependency for cleanup.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Internally, registerShuffleForCleanup simply executes <> for the given ShuffleDependency.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerShuffleForCleanup is used when ShuffleDependency is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[registerForCleanup]] Registering Object Reference For Cleanup

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#source-scala_9","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerForCleanup( objectForCleanup: AnyRef, task: CleanupTask): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerForCleanup adds the input objectForCleanup to the <> internal queue.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Despite the widest-possible AnyRef type of the input objectForCleanup, the type is really CleanupTaskWeakReference which is a custom Java's {java-javadoc-url}/java/lang/ref/WeakReference.html[java.lang.ref.WeakReference].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              registerForCleanup is used when ContextCleaner is requested to <>, <>, <>, <>, and <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[doCleanupShuffle]] Shuffle Cleanup

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#source-scala_10","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCleanupShuffle( shuffleId: Int, blocking: Boolean): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCleanupShuffle performs a shuffle cleanup which is to remove the shuffle from the current scheduler:MapOutputTrackerMaster.md[MapOutputTrackerMaster] and storage:BlockManagerMaster.md[BlockManagerMaster]. doCleanupShuffle also notifies core:CleanerListener.md[CleanerListeners].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Internally, when executed, doCleanupShuffle prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#cleaning-shuffle-id","title":"Cleaning shuffle [id]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCleanupShuffle uses core:SparkEnv.md[SparkEnv] to access the core:SparkEnv.md#mapOutputTracker[MapOutputTracker] to scheduler:MapOutputTracker.md#unregisterShuffle[unregister the given shuffle].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCleanupShuffle uses core:SparkEnv.md[SparkEnv] to access the core:SparkEnv.md#blockManager[BlockManagerMaster] to storage:BlockManagerMaster.md#removeShuffle[remove the shuffle blocks] (for the given shuffleId).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCleanupShuffle informs all registered <> that core:CleanerListener.md#shuffleCleaned[shuffle was cleaned].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, doCleanupShuffle prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#sourceplaintext_2","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#cleaned-shuffle-id","title":"Cleaned shuffle [id]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In case of any exception, doCleanupShuffle prints out the following ERROR message to the logs and the exception itself:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#sourceplaintext_3","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#error-cleaning-shuffle-id","title":"Error cleaning shuffle [id]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCleanupShuffle is used when ContextCleaner is requested to <> and (interestingly) while fitting an ALSModel (in Spark MLlib).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[logging]] Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Enable ALL logging level for org.apache.spark.ContextCleaner logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/ContextCleaner/#sourceplaintext_4","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#log4jloggerorgapachesparkcontextcleanerall","title":"log4j.logger.org.apache.spark.ContextCleaner=ALL","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Refer to spark-logging.md[Logging].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[internal-properties]] Internal Properties

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              === [[referenceBuffer]] referenceBuffer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              === [[referenceQueue]] referenceQueue

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/InMemoryStore/","title":"InMemoryStore","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              InMemoryStore is a KVStore.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/InMemoryStore/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              InMemoryStore takes no arguments to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              InMemoryStore is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • FsHistoryProvider is created and requested to createInMemoryStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • AppStatusStore utility is used to create an AppStatusStore for a live Spark application
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/KVStore/","title":"KVStore","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              KVStore is an abstraction of key-value stores.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              KVStore is a Java Closeable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/KVStore/#contract","title":"Contract","text":""},{"location":"core/KVStore/#count","title":"count
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              long count(\n  Class<?> type)\nlong count(\n  Class<?> type,\n  String index,\n  Object indexedValue)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"core/KVStore/#delete","title":"delete
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              void delete(\n  Class<?> type,\n  Object naturalKey)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"core/KVStore/#getmetadata","title":"getMetadata
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              <T> T getMetadata(\n  Class<T> klass)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"core/KVStore/#read","title":"read
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              <T> T read(\n  Class<T> klass,\n  Object naturalKey)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"core/KVStore/#removeallbyindexvalues","title":"removeAllByIndexValues
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              <T> boolean removeAllByIndexValues(\n  Class<T> klass,\n  String index,\n  Collection<?> indexValues)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"core/KVStore/#setmetadata","title":"setMetadata
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              void setMetadata(\n  Object value)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"core/KVStore/#view","title":"view
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              <T> KVStoreView<T> view(\n  Class<T> type)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              KVStoreView over entities of the given type

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"core/KVStore/#write","title":"write
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              void write(\n  Object value)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"core/KVStore/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ElementTrackingStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • InMemoryStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • LevelDB
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/LevelDB/","title":"LevelDB","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              LevelDB is a KVStore for FsHistoryProvider.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"core/LevelDB/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              LevelDB takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Path
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • KVStoreSerializer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                LevelDB is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • KVUtils utility is used to open (a LevelDB store)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"core/RetryingBlockFetcher/","title":"RetryingBlockFetcher","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RetryingBlockFetcher is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RetryingBlockFetcher is <> and immediately <> when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • NettyBlockTransferService is requested to storage:NettyBlockTransferService.md#fetchBlocks[fetchBlocks] (when network:TransportConf.md#io.maxRetries[maxIORetries] is greater than 0 which it is by default)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RetryingBlockFetcher uses a <> to core:BlockFetchStarter.md#createAndStart[createAndStart] when requested to <> and later <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                [[outstandingBlocksIds]] RetryingBlockFetcher uses outstandingBlocksIds internal registry of outstanding block IDs to fetch that is initially the <> when <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                At <>, RetryingBlockFetcher prints out the following INFO message to the logs (with the number of <>):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Retrying fetch ([retryCount]/[maxRetries]) for [size] outstanding blocks after [retryWaitTime] ms\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                On <> and <>, <> removes the block ID from <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                [[currentListener]] RetryingBlockFetcher uses a <> to remove block IDs from the <> internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                == [[creating-instance]] Creating RetryingBlockFetcher Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RetryingBlockFetcher takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • [[conf]] network:TransportConf.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • [[fetchStarter]] core:BlockFetchStarter.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • [[blockIds]] Block IDs to fetch
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • [[listener]] core:BlockFetchingListener.md[]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                == [[start]] Starting RetryingBlockFetcher -- start Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"core/RetryingBlockFetcher/#source-java","title":"[source, java]","text":""},{"location":"core/RetryingBlockFetcher/#void-start","title":"void start()","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                start simply <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                start is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • NettyBlockTransferService is requested to storage:NettyBlockTransferService.md#fetchBlocks[fetchBlocks] (when network:TransportConf.md#io.maxRetries[maxIORetries] is greater than 0 which it is by default)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                == [[initiateRetry]] initiateRetry Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"core/RetryingBlockFetcher/#source-java_1","title":"[source, java]","text":""},{"location":"core/RetryingBlockFetcher/#synchronized-void-initiateretry","title":"synchronized void initiateRetry()","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                initiateRetry...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"core/RetryingBlockFetcher/#note","title":"[NOTE]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                initiateRetry is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • RetryingBlockFetcher is requested to <>"},{"location":"core/RetryingBlockFetcher/#retryingblockfetchlistener-is-requested-to","title":"* RetryingBlockFetchListener is requested to <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[fetchAllOutstanding]] fetchAllOutstanding Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"core/RetryingBlockFetcher/#source-java_2","title":"[source, java]","text":""},{"location":"core/RetryingBlockFetcher/#void-fetchalloutstanding","title":"void fetchAllOutstanding()","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  fetchAllOutstanding requests <> to core:BlockFetchStarter.md#createAndStart[createAndStart] for the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: fetchAllOutstanding is used when RetryingBlockFetcher is requested to <> and <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[RetryingBlockFetchListener]] RetryingBlockFetchListener

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  RetryingBlockFetchListener is a core:BlockFetchingListener.md[] that <> uses to remove block IDs from the <> internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  === [[RetryingBlockFetchListener-onBlockFetchSuccess]] onBlockFetchSuccess Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"core/RetryingBlockFetcher/#source-scala","title":"[source, scala]","text":""},{"location":"core/RetryingBlockFetcher/#void-onblockfetchsuccessstring-blockid-managedbuffer-data","title":"void onBlockFetchSuccess(String blockId, ManagedBuffer data)","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: onBlockFetchSuccess is part of core:BlockFetchingListener.md#onBlockFetchSuccess[BlockFetchingListener Contract].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  onBlockFetchSuccess...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  === [[RetryingBlockFetchListener-onBlockFetchFailure]] onBlockFetchFailure Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"core/RetryingBlockFetcher/#source-scala_1","title":"[source, scala]","text":""},{"location":"core/RetryingBlockFetcher/#void-onblockfetchfailurestring-blockid-throwable-exception","title":"void onBlockFetchFailure(String blockId, Throwable exception)","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: onBlockFetchFailure is part of core:BlockFetchingListener.md#onBlockFetchFailure[BlockFetchingListener Contract].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  onBlockFetchFailure...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"demo/","title":"Demos","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The following demos are available:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • DiskBlockManager and Block Data
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"demo/diskblockmanager-and-block-data/","title":"Demo: DiskBlockManager and Block Data","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The demo shows how Spark stores data blocks on local disk (using DiskBlockManager and DiskStore among the services).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"demo/diskblockmanager-and-block-data/#configure-local-directories","title":"Configure Local Directories","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Spark uses spark.local.dir configuration property for one or more local directories to store data blocks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Start spark-shell with the property set to a directory of your choice (say local-dirs). Use one directory for easier monitoring.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $SPARK_HOME/bin/spark-shell --conf spark.local.dir=local-dirs\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When started, Spark will create a proper directory layout. You are interested in blockmgr-[uuid] directory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"demo/diskblockmanager-and-block-data/#create-data-blocks","title":"\"Create\" Data Blocks","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Execute the following Spark application that forces persisting (caching) data to disk.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  import org.apache.spark.storage.StorageLevel\nspark.range(2).persist(StorageLevel.DISK_ONLY).count\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"demo/diskblockmanager-and-block-data/#observe-block-files","title":"Observe Block Files","text":""},{"location":"demo/diskblockmanager-and-block-data/#command-line","title":"Command Line","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Go to the blockmgr-[uuid] directory and observe the block files. There should be a few. Do you know how many and why?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ tree local-dirs/blockmgr-b7167b5a-ae8d-404b-8de2-1a0fb101fe00/\nlocal-dirs/blockmgr-b7167b5a-ae8d-404b-8de2-1a0fb101fe00/\n\u251c\u2500\u2500 00\n\u251c\u2500\u2500 04\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_8_0.data\n\u251c\u2500\u2500 06\n\u251c\u2500\u2500 08\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_8_0.index\n...\n\u251c\u2500\u2500 37\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_7_0.index\n\u251c\u2500\u2500 38\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_4_0.data\n\u251c\u2500\u2500 39\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_9_0.index\n\u2514\u2500\u2500 3a\n    \u2514\u2500\u2500 shuffle_0_6_0.data\n\n47 directories, 48 files\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"demo/diskblockmanager-and-block-data/#diskblockmanager","title":"DiskBlockManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The files are managed by DiskBlockManager that is available to access all the files as well.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  import org.apache.spark.SparkEnv\nSparkEnv.get.blockManager.diskBlockManager.getAllFiles()\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"demo/diskblockmanager-and-block-data/#use-web-ui","title":"Use web UI","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Open http://localhost:4040 and switch to Storage tab (at http://localhost:4040/storage/). You should see one RDD cached.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Click the link in RDD Name column and review the information.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"demo/diskblockmanager-and-block-data/#enable-logging","title":"Enable Logging","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Enable ALL logging level for org.apache.spark.storage.DiskStore and org.apache.spark.storage.DiskBlockManager loggers to have an even deeper insight on the block storage internals.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  log4j.logger.org.apache.spark.storage.DiskBlockManager=ALL\nlog4j.logger.org.apache.spark.storage.DiskStore=ALL\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"dynamic-allocation/","title":"Dynamic Allocation of Executors","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Dynamic Allocation of Executors (Dynamic Resource Allocation or Elastic Scaling) is a Spark service for adding and removing Spark executors dynamically on demand to match workload.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Unlike the \"traditional\" static allocation where a Spark application reserves CPU and memory resources upfront (irrespective of how much it may eventually use), in dynamic allocation you get as much as needed and no more. It scales the number of executors up and down based on workload, i.e. idle executors are removed, and when there are pending tasks waiting for executors to be launched on, dynamic allocation requests them.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Dynamic Allocation is enabled (and SparkContext creates an ExecutorAllocationManager) when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1. spark.dynamicAllocation.enabled configuration property is enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  2. spark.master is non-local

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  3. SchedulerBackend is an ExecutorAllocationClient

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorAllocationManager is the heart of Dynamic Resource Allocation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When enabled, it is recommended to use the External Shuffle Service.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Dynamic Allocation comes with the policy of scaling executors up and down as follows:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1. Scale Up Policy requests new executors when there are pending tasks and increases the number of executors exponentially since executors start slow and Spark application may need slightly more.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  2. Scale Down Policy removes executors that have been idle for spark.dynamicAllocation.executorIdleTimeout seconds.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"dynamic-allocation/#performance-metrics","title":"Performance Metrics","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorAllocationManagerSource metric source is used to report performance metrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"dynamic-allocation/#sparkcontextkillexecutors","title":"SparkContext.killExecutors","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkContext.killExecutors is unsupported with Dynamic Allocation enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"dynamic-allocation/#programmable-dynamic-allocation","title":"Programmable Dynamic Allocation","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkContext offers a developer API to scale executors up or down.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"dynamic-allocation/#getting-initial-number-of-executors-for-dynamic-allocation","title":"Getting Initial Number of Executors for Dynamic Allocation
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getDynamicAllocationInitialExecutors(conf: SparkConf): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getDynamicAllocationInitialExecutors first makes sure that <> is equal or greater than <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: <> falls back to <> if not set. Why to print the WARN message to the logs?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If not, you should see the following WARN message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  spark.dynamicAllocation.initialExecutors less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getDynamicAllocationInitialExecutors makes sure that executor:Executor.md#spark.executor.instances[spark.executor.instances] is greater than <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: Both executor:Executor.md#spark.executor.instances[spark.executor.instances] and <> fall back to 0 when no defined explicitly.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If not, you should see the following WARN message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getDynamicAllocationInitialExecutors sets the initial number of executors to be the maximum of:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • spark.dynamicAllocation.minExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • spark.dynamicAllocation.initialExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • spark.executor.instances
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • 0

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  You should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Using initial executors = [initialExecutors], max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getDynamicAllocationInitialExecutors is used when ExecutorAllocationManager is requested to set the initial number of executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"dynamic-allocation/#resources","title":"Resources","text":""},{"location":"dynamic-allocation/#documentation","title":"Documentation","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Dynamic Allocation in the official documentation of Apache Spark
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Dynamic allocation in the documentation of Cloudera Data Platform (CDP)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"dynamic-allocation/#slides","title":"Slides","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Dynamic Allocation in Spark by Databricks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"dynamic-allocation/ExecutorAllocationClient/","title":"ExecutorAllocationClient","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorAllocationClient is an abstraction of schedulers that can communicate with a cluster manager to request or kill executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"dynamic-allocation/ExecutorAllocationClient/#contract","title":"Contract","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#active-executor-ids","title":"Active Executor IDs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getExecutorIds(): Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is requested for active executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#isexecutoractive","title":"isExecutorActive
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  isExecutorActive(\n  id: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Whether a given executor (by ID) is active (and can be used to execute tasks)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • FIXME
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#killing-executors","title":"Killing Executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  killExecutors(\n  executorIds: Seq[String],\n  adjustTargetNumExecutors: Boolean,\n  countFailures: Boolean,\n  force: Boolean = false): Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Requests a cluster manager to kill given executors and returns whether the request has been acknowledged by the cluster manager (true) or not (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutorAllocationClient is requested to kill an executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutorAllocationManager is requested to removeExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is requested to kill executors and killAndReplaceExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlacklistTracker is requested to kill an executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • DriverEndpoint is requested to handle a KillExecutorsOnHost message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#killing-executors-on-host","title":"Killing Executors on Host
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  killExecutorsOnHost(\n  host: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlacklistTracker is requested to kill executors on a blacklisted node
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#requesting-additional-executors","title":"Requesting Additional Executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  requestExecutors(\n  numAdditionalExecutors: Int): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Requests additional executors from a cluster manager and returns whether the request has been acknowledged by the cluster manager (true) or not (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is requested for additional executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#updating-total-executors","title":"Updating Total Executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  requestTotalExecutors(\n  resourceProfileIdToNumExecutors: Map[Int, Int],\n  numLocalityAwareTasksPerResourceProfileId: Map[Int, Int],\n  hostToLocalTaskCount: Map[Int, Map[String, Int]]): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Updates a cluster manager with the exact number of executors desired. Returns whether the request has been acknowledged by the cluster manager (true) or not (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is requested to update the number of total executors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutorAllocationManager is requested to start, updateAndSyncNumExecutorsTarget, addExecutors, removeExecutors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • CoarseGrainedSchedulerBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • KubernetesClusterSchedulerBackend (Spark on Kubernetes)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • MesosCoarseGrainedSchedulerBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • StandaloneSchedulerBackend ([Spark Standalone]https://books.japila.pl/spark-standalone-internals/StandaloneSchedulerBackend))
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • YarnSchedulerBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"dynamic-allocation/ExecutorAllocationClient/#killing-single-executor","title":"Killing Single Executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  killExecutor(\n  executorId: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  killExecutor kill the given executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  killExecutor\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutorAllocationManager removes an executor.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is requested to kill executors.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#decommissioning-executors","title":"Decommissioning Executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  decommissionExecutors(\n  executorsAndDecomInfo: Array[(String, ExecutorDecommissionInfo)],\n  adjustTargetNumExecutors: Boolean,\n  triggeredByExecutor: Boolean): Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  decommissionExecutors kills the given executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  decommissionExecutors\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutorAllocationClient is requested to decommission a single executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutorAllocationManager is requested to remove executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • StandaloneSchedulerBackend (Spark Standalone) is requested to executorDecommissioned
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#decommissioning-single-executor","title":"Decommissioning Single Executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  decommissionExecutor(\n  executorId: String,\n  decommissionInfo: ExecutorDecommissionInfo,\n  adjustTargetNumExecutors: Boolean,\n  triggeredByExecutor: Boolean = false): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  decommissionExecutor...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  decommissionExecutor\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • DriverEndpoint is requested to handle a ExecutorDecommissioning message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"dynamic-allocation/ExecutorAllocationListener/","title":"ExecutorAllocationListener","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorAllocationListener is a SparkListener.md[] that intercepts events about stages, tasks, and executors, i.e. onStageSubmitted, onStageCompleted, onTaskStart, onTaskEnd, onExecutorAdded, and onExecutorRemoved. Using the events ExecutorAllocationManager can manage the pool of dynamically managed executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Internal Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorAllocationListener is an internal class of ExecutorAllocationManager with full access to internal registries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"dynamic-allocation/ExecutorAllocationManager/","title":"ExecutorAllocationManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorAllocationManager can be used to dynamically allocate executors based on processing workload.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorAllocationManager intercepts Spark events using the internal ExecutorAllocationListener that keeps track of the workload.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"dynamic-allocation/ExecutorAllocationManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorAllocationManager takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutorAllocationClient
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • LiveListenerBus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ContextCleaner (default: None)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Clock (default: SystemClock)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorAllocationManager is created (and started) when SparkContext is created (with Dynamic Allocation of Executors enabled)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"dynamic-allocation/ExecutorAllocationManager/#validating-configuration","title":"Validating Configuration
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    validateSettings(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    validateSettings makes sure that the settings for dynamic allocation are correct.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    validateSettings throws a SparkException when the following are not met:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spark.dynamicAllocation.minExecutors must be positive

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spark.dynamicAllocation.maxExecutors must be 0 or greater

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spark.dynamicAllocation.minExecutors must be less than or equal to spark.dynamicAllocation.maxExecutors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spark.dynamicAllocation.executorIdleTimeout must be greater than 0

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spark.shuffle.service.enabled must be enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • The number of tasks per core, i.e. spark.executor.cores divided by spark.task.cpus, is not zero.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#performance-metrics","title":"Performance Metrics","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorAllocationManager uses ExecutorAllocationManagerSource for performance metrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"dynamic-allocation/ExecutorAllocationManager/#executormonitor","title":"ExecutorMonitor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorAllocationManager creates an ExecutorMonitor when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorMonitor is added to the management queue (of LiveListenerBus) when ExecutorAllocationManager is started.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorMonitor is attached (to the ContextCleaner) when ExecutorAllocationManager is started.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorMonitor is requested to reset when ExecutorAllocationManager is requested to reset.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorMonitor is used for the performance metrics:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • numberExecutorsPendingToRemove (based on pendingRemovalCount)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • numberAllExecutors (based on executorCount)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorMonitor is used for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • timedOutExecutors when ExecutorAllocationManager is requested to schedule
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • executorCount when ExecutorAllocationManager is requested to addExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • executorCount, pendingRemovalCount and executorsKilled when ExecutorAllocationManager is requested to removeExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#executorallocationlistener","title":"ExecutorAllocationListener

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorAllocationManager creates an ExecutorAllocationListener when created to intercept Spark events that impact the allocation policy.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorAllocationListener is added to the management queue (of LiveListenerBus) when ExecutorAllocationManager is started.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorAllocationListener is used to calculate the maximum number of executors needed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#sparkdynamicallocationexecutorallocationratio","title":"spark.dynamicAllocation.executorAllocationRatio

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorAllocationManager uses spark.dynamicAllocation.executorAllocationRatio configuration property for maxNumExecutorsNeeded.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#tasksperexecutorforfullparallelism","title":"tasksPerExecutorForFullParallelism

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorAllocationManager uses spark.executor.cores and spark.task.cpus configuration properties for the number of tasks that can be submitted to an executor for full parallelism.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • maxNumExecutorsNeeded
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#maximum-number-of-executors-needed","title":"Maximum Number of Executors Needed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    maxNumExecutorsNeeded(): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    maxNumExecutorsNeeded requests the ExecutorAllocationListener for the number of pending and running tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    maxNumExecutorsNeeded is the smallest integer value that is greater than or equal to the multiplication of the total number of pending and running tasks by executorAllocationRatio divided by tasksPerExecutorForFullParallelism.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    maxNumExecutorsNeeded is used for:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • updateAndSyncNumExecutorsTarget
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • numberMaxNeededExecutors performance metric
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#executorallocationclient","title":"ExecutorAllocationClient

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorAllocationManager is given an ExecutorAllocationClient when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#starting-executorallocationmanager","title":"Starting ExecutorAllocationManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    start(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    start requests the LiveListenerBus to add to the management queue:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExecutorAllocationListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExecutorMonitor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    start requests the ContextCleaner (if defined) to attach the ExecutorMonitor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    creates a scheduleTask (a Java Runnable) for schedule when started.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    start requests the ScheduledExecutorService to schedule the scheduleTask every 100 ms.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The schedule delay of 100 is not configurable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    start requests the ExecutorAllocationClient to request the total executors with the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • numExecutorsTarget
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • localityAwareTasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • hostToLocalTaskCount

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    start is used when SparkContext is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#scheduling-executors","title":"Scheduling Executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    schedule(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    schedule requests the ExecutorMonitor for timedOutExecutors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    If there are executors to be removed, schedule turns the initializing internal flag off.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    schedule updateAndSyncNumExecutorsTarget with the current time.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In the end, schedule removes the executors to be removed if there are any.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#updateandsyncnumexecutorstarget","title":"updateAndSyncNumExecutorsTarget
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    updateAndSyncNumExecutorsTarget(\n  now: Long): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    updateAndSyncNumExecutorsTarget maxNumExecutorsNeeded.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    updateAndSyncNumExecutorsTarget...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#stopping-executorallocationmanager","title":"Stopping ExecutorAllocationManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    stop shuts down <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    stop waits 10 seconds for the termination to be complete.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    stop is used when SparkContext is requested to stop

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#spark-dynamic-executor-allocation-allocation-executor","title":"spark-dynamic-executor-allocation Allocation Executor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    spark-dynamic-executor-allocation allocation executor is a...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#executorallocationmanagersource","title":"ExecutorAllocationManagerSource

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorAllocationManagerSource

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#removing-executors","title":"Removing Executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    removeExecutors(\n  executors: Seq[(String, Int)]): Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    removeExecutors...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    removeExecutors\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExecutorAllocationManager is requested to schedule executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Enable ALL logging level for org.apache.spark.ExecutorAllocationManager logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    log4j.logger.org.apache.spark.ExecutorAllocationManager=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/","title":"ExecutorAllocationManagerSource","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorAllocationManagerSource is a metric source for Dynamic Allocation of Executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#source-name","title":"Source Name

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorAllocationManagerSource is registered under the name ExecutorAllocationManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#gauges","title":"Gauges","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numberexecutorstoadd","title":"numberExecutorsToAdd

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    executors/numberExecutorsToAdd for numExecutorsToAdd

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numberexecutorspendingtoremove","title":"numberExecutorsPendingToRemove

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    executors/numberExecutorsPendingToRemove for pendingRemovalCount

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numberallexecutors","title":"numberAllExecutors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    executors/numberAllExecutors for executorCount

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numbertargetexecutors","title":"numberTargetExecutors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    executors/numberTargetExecutors for numExecutorsTarget

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numbermaxneededexecutors","title":"numberMaxNeededExecutors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    executors/numberMaxNeededExecutors for maxNumExecutorsNeeded

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/","title":"ExecutorMonitor","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorMonitor is a SparkListener and a CleanerListener.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"dynamic-allocation/ExecutorMonitor/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorMonitor takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExecutorAllocationClient
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • LiveListenerBus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Clock

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ExecutorMonitor is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ExecutorAllocationManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"dynamic-allocation/ExecutorMonitor/#shuffleids-registry","title":"shuffleIds Registry
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      shuffleIds: Set[Int]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ExecutorMonitor uses a mutable HashSet to track shuffle IDs...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      shuffleIds is initialized only when shuffleTrackingEnabled is enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      shuffleIds is used by Tracker internal class for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • updateTimeout, addShuffle, removeShuffle and updateActiveShuffles
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#executors-registry","title":"Executors Registry
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      executors: ConcurrentHashMap[String, Tracker]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ExecutorMonitor uses a Java ConcurrentHashMap to track available executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      An executor is added when (via ensureExecutorIsTracked):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • onBlockUpdated
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • onExecutorAdded
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • onTaskStart

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      An executor is removed when onExecutorRemoved.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      All executors are removed when reset.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      executors is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • onOtherEvent (cleanupShuffle)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • executorCount
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • executorsKilled
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • onUnpersistRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • onTaskEnd
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • onJobStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • onJobEnd
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • pendingRemovalCount
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • timedOutExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#fetchfromshufflesvcenabled-flag","title":"fetchFromShuffleSvcEnabled Flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      fetchFromShuffleSvcEnabled: Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ExecutorMonitor initializes fetchFromShuffleSvcEnabled internal flag based on the values of spark.shuffle.service.enabled and spark.shuffle.service.fetch.rdd.enabled configuration properties.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      fetchFromShuffleSvcEnabled is enabled (true) when the aforementioned configuration properties are.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      fetchFromShuffleSvcEnabled is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • onBlockUpdated
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#shuffletrackingenabled-flag","title":"shuffleTrackingEnabled Flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      shuffleTrackingEnabled: Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ExecutorMonitor initializes shuffleTrackingEnabled internal flag based on the values of spark.shuffle.service.enabled and spark.dynamicAllocation.shuffleTracking.enabled configuration properties.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      shuffleTrackingEnabled is enabled (true) when the following holds:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      1. spark.shuffle.service.enabled is disabled
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      2. spark.dynamicAllocation.shuffleTracking.enabled is enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      When enabled, shuffleTrackingEnabled is used to skip execution of the following (making them noops):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • onJobStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • onJobEnd

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      When disabled, shuffleTrackingEnabled is used for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • onTaskEnd
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • shuffleCleaned
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • shuffleIds
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#sparkdynamicallocationcachedexecutoridletimeout","title":"spark.dynamicAllocation.cachedExecutorIdleTimeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ExecutorMonitor reads spark.dynamicAllocation.cachedExecutorIdleTimeout configuration property for Tracker to updateTimeout.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onblockupdated","title":"onBlockUpdated
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onBlockUpdated(\n  event: SparkListenerBlockUpdated): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onBlockUpdated\u00a0is part of the SparkListenerInterface abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onBlockUpdated...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onexecutoradded","title":"onExecutorAdded
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onExecutorAdded(\n  event: SparkListenerExecutorAdded): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onExecutorAdded\u00a0is part of the SparkListenerInterface abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onExecutorAdded...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onexecutorremoved","title":"onExecutorRemoved
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onExecutorRemoved(\n  event: SparkListenerExecutorRemoved): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onExecutorRemoved\u00a0is part of the SparkListenerInterface abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onExecutorRemoved...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onjobend","title":"onJobEnd
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onJobEnd(\n  event: SparkListenerJobEnd): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onJobEnd\u00a0is part of the SparkListenerInterface abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onJobEnd...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onjobstart","title":"onJobStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onJobStart(\n  event: SparkListenerJobStart): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onJobStart\u00a0is part of the SparkListenerInterface abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onJobStart does nothing and simply returns when the shuffleTrackingEnabled flag is turned off (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onJobStart requests the input SparkListenerJobStart for the StageInfos and converts...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onotherevent","title":"onOtherEvent
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onOtherEvent(\n  event: SparkListenerEvent): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onOtherEvent\u00a0is part of the SparkListenerInterface abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onOtherEvent...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#cleanupshuffle","title":"cleanupShuffle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      cleanupShuffle(\n  id: Int): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      cleanupShuffle...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      cleanupShuffle\u00a0is used when onOtherEvent

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#ontaskend","title":"onTaskEnd
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onTaskEnd(\n  event: SparkListenerTaskEnd): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onTaskEnd\u00a0is part of the SparkListenerInterface abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onTaskEnd...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#ontaskstart","title":"onTaskStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onTaskStart(\n  event: SparkListenerTaskStart): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onTaskStart\u00a0is part of the SparkListenerInterface abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onTaskStart...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onunpersistrdd","title":"onUnpersistRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onUnpersistRDD(\n  event: SparkListenerUnpersistRDD): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onUnpersistRDD\u00a0is part of the SparkListenerInterface abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onUnpersistRDD...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#reset","title":"reset
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      reset(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      reset...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      reset\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • FIXME
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#shufflecleaned","title":"shuffleCleaned
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      shuffleCleaned(\n  shuffleId: Int): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      shuffleCleaned\u00a0is part of the CleanerListener abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      shuffleCleaned...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#timedoutexecutors","title":"timedOutExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      timedOutExecutors(): Seq[String]\ntimedOutExecutors(\n  when: Long): Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      timedOutExecutors...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      timedOutExecutors\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ExecutorAllocationManager is requested to schedule
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#executorcount","title":"executorCount
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      executorCount: Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      executorCount...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      executorCount\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ExecutorAllocationManager is requested to addExecutors and removeExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ExecutorAllocationManagerSource is requested for numberAllExecutors performance metric
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#pendingremovalcount","title":"pendingRemovalCount
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      pendingRemovalCount: Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      pendingRemovalCount...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      pendingRemovalCount\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ExecutorAllocationManager is requested to removeExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ExecutorAllocationManagerSource is requested for numberExecutorsPendingToRemove performance metric
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#executorskilled","title":"executorsKilled
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      executorsKilled(\n  ids: Seq[String]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      executorsKilled...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      executorsKilled\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ExecutorAllocationManager is requested to removeExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#ensureexecutoristracked","title":"ensureExecutorIsTracked
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ensureExecutorIsTracked(\n  id: String,\n  resourceProfileId: Int): Tracker\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ensureExecutorIsTracked...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ensureExecutorIsTracked\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • onBlockUpdated
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • onExecutorAdded
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • onTaskStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#getresourceprofileid","title":"getResourceProfileId
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getResourceProfileId(\n  executorId: String): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getResourceProfileId...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getResourceProfileId\u00a0is used for testing only.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"dynamic-allocation/Tracker/","title":"Tracker","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Tracker is a private internal class of ExecutorMonitor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"dynamic-allocation/Tracker/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Tracker takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • resourceProfileId

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Tracker is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExecutorMonitor is requested to ensureExecutorIsTracked
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"dynamic-allocation/Tracker/#cachedblocks-internal-registry","title":"cachedBlocks Internal Registry
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        cachedBlocks: Map[Int, BitSet]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Tracker uses cachedBlocks internal registry for cached blocks (RDD IDs and partition IDs stored in an executor).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        cachedBlocks is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExecutorMonitor is requested to onBlockUpdated, onUnpersistRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Tracker is requested to updateTimeout
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"dynamic-allocation/Tracker/#removeshuffle","title":"removeShuffle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        removeShuffle(\n  id: Int): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        removeShuffle...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        removeShuffle\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExecutorMonitor is requested to cleanupShuffle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"dynamic-allocation/Tracker/#updateactiveshuffles","title":"updateActiveShuffles
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        updateActiveShuffles(\n  ids: Iterable[Int]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        updateActiveShuffles...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        updateActiveShuffles\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExecutorMonitor is requested to onJobStart and onJobEnd
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"dynamic-allocation/Tracker/#updaterunningtasks","title":"updateRunningTasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        updateRunningTasks(\n  delta: Int): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        updateRunningTasks...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        updateRunningTasks\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExecutorMonitor is requested to onTaskStart, onTaskEnd and onExecutorAdded
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"dynamic-allocation/Tracker/#updatetimeout","title":"updateTimeout
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        updateTimeout(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        updateTimeout...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        updateTimeout\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExecutorMonitor is requested to onBlockUpdated and onUnpersistRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Tracker is requested to updateRunningTasks, removeShuffle, updateActiveShuffles
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"dynamic-allocation/configuration-properties/","title":"Spark Configuration Properties","text":""},{"location":"dynamic-allocation/configuration-properties/#sparkdynamicallocation","title":"spark.dynamicAllocation","text":""},{"location":"dynamic-allocation/configuration-properties/#cachedexecutoridletimeout","title":"cachedExecutorIdleTimeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.dynamicAllocation.cachedExecutorIdleTimeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        How long (in seconds) to keep blocks cached

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: The largest value representable as an Int

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Must be >= 0

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExecutorMonitor is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • RDD is requested to localCheckpoint (simply to print out a WARN message)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"dynamic-allocation/configuration-properties/#enabled","title":"enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.dynamicAllocation.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Enables Dynamic Allocation of Executors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BarrierJobAllocationFailed is requested for ERROR_MESSAGE_RUN_BARRIER_WITH_DYN_ALLOCATION (for reporting purposes)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • RDD is requested to localCheckpoint (for reporting purposes)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkSubmitArguments is requested to loadEnvironmentArguments (for validation purposes)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Utils is requested to isDynamicAllocationEnabled
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"dynamic-allocation/configuration-properties/#executorallocationratio","title":"executorAllocationRatio

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.dynamicAllocation.executorAllocationRatio

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 1.0

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Must be between 0 (exclusive) and 1.0 (inclusive)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExecutorAllocationManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"dynamic-allocation/configuration-properties/#executoridletimeout","title":"executorIdleTimeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.dynamicAllocation.executorIdleTimeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 60

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"dynamic-allocation/configuration-properties/#initialexecutors","title":"initialExecutors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.dynamicAllocation.initialExecutors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: spark.dynamicAllocation.minExecutors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"dynamic-allocation/configuration-properties/#maxexecutors","title":"maxExecutors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.dynamicAllocation.maxExecutors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: Int.MaxValue

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"dynamic-allocation/configuration-properties/#minexecutors","title":"minExecutors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.dynamicAllocation.minExecutors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 0

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"dynamic-allocation/configuration-properties/#schedulerbacklogtimeout","title":"schedulerBacklogTimeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.dynamicAllocation.schedulerBacklogTimeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        (in seconds)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: 1

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"dynamic-allocation/configuration-properties/#shuffletrackingenabled","title":"shuffleTracking.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.dynamicAllocation.shuffleTracking.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExecutorMonitor is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"dynamic-allocation/configuration-properties/#shuffletrackingtimeout","title":"shuffleTracking.timeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.dynamicAllocation.shuffleTracking.timeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        (in millis)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: The largest value representable as an Int

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"dynamic-allocation/configuration-properties/#sustainedschedulerbacklogtimeout","title":"sustainedSchedulerBacklogTimeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spark.dynamicAllocation.sustainedSchedulerBacklogTimeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default: spark.dynamicAllocation.schedulerBacklogTimeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"executor/","title":"Executor","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark applications start one or more Executors for executing tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        By default (in Static Allocation of Executors) executors run for the entire lifetime of a Spark application (unlike in Dynamic Allocation).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Executors are managed by ExecutorBackend.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Executors reports heartbeat and partial metrics for active tasks to the HeartbeatReceiver RPC Endpoint on the driver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Executors provide in-memory storage for RDDs that are cached in Spark applications (via BlockManager).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When started, an executor first registers itself with the driver that establishes a communication channel directly to the driver to accept tasks for execution.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Executor offers are described by executor id and the host on which an executor runs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Executors can run multiple tasks over their lifetime, both in parallel and sequentially, and track running tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Executors use an Executor task launch worker thread pool for launching tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Executors send metrics (and heartbeats) using the Heartbeat Sender Thread.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"executor/CoarseGrainedExecutorBackend/","title":"CoarseGrainedExecutorBackend","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        CoarseGrainedExecutorBackend is an ExecutorBackend that controls the lifecycle of a single executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        CoarseGrainedExecutorBackend is an IsolatedThreadSafeRpcEndpoint that connects to the driver (before accepting messages) and shuts down when the driver disconnects.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        CoarseGrainedExecutorBackend can receive the following messages:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DecommissionExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • KillTask
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • LaunchTask
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • RegisteredExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Shutdown
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • StopExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • UpdateDelegationTokens

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When launched, CoarseGrainedExecutorBackend immediately connects to the parent CoarseGrainedSchedulerBackend (to inform that it is ready to launch tasks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        CoarseGrainedExecutorBackend registers the Executor RPC endpoint to communicate with the driver (with DriverEndpoint).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        CoarseGrainedExecutorBackend sends regular executor status updates to the driver (to keep the Spark scheduler updated on the number of CPU cores free for task scheduling).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        CoarseGrainedExecutorBackend is started in a resource container (as a standalone application).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"executor/CoarseGrainedExecutorBackend/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        CoarseGrainedExecutorBackend takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • RpcEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Driver URL
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Bind Address (unused)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Hostname
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Number of CPU cores
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Resources Configuration File
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ResourceProfile

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          driverUrl, executorId, hostname, cores and userClassPath correspond to CoarseGrainedExecutorBackend standalone application's command-line arguments.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          CoarseGrainedExecutorBackend is created upon launching CoarseGrainedExecutorBackend standalone application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"executor/CoarseGrainedExecutorBackend/#executor","title":"Executor","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          CoarseGrainedExecutorBackend manages the lifecycle of a single Executor:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • An Executor is created upon receiving a RegisteredExecutor message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Stopped upon receiving a Shutdown message (that happens on a separate CoarseGrainedExecutorBackend-stop-executor thread)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The Executor is used for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • decommissionSelf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Launching a task (upon receiving a LaunchTask message)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Killing a task (upon receiving a KillTask message)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Reporting the number of CPU cores used for a given task in statusUpdate
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"executor/CoarseGrainedExecutorBackend/#statusUpdate","title":"Reporting Task Status","text":"ExecutorBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          statusUpdate(\n  taskId: Long,\n  state: TaskState,\n  data: ByteBuffer): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          statusUpdate is part of the ExecutorBackend abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          statusUpdate...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"executor/CoarseGrainedExecutorBackend/#onStart","title":"Starting Up","text":"RpcEndpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          onStart(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          onStart is part of the RpcEndpoint abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          With spark.decommission.enabled enabled, onStart...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          onStart prints out the following INFO message to the logs (with the driverUrl):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Connecting to driver: [driverUrl]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          onStart builds a transport-related configuration for shuffle module.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          onStart parseOrFindResources in the given resourcesFileOpt, if defined, and initializes the _resources internal registry (of ResourceInformations).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          onStart asyncSetupEndpointRefByURI (with the given driverUrl).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          If successful, onStart initializes the driver internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          onStart makes this CoarseGrainedExecutorBackend available to other Spark services using the executorBackend registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          onStart sends a blocking RegisterExecutor message. If successful, onStart sends a RegisteredExecutor (to itself).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In case of any failure, onStart terminates this CoarseGrainedExecutorBackend with the error code 1 and the following reason (with no notification to the driver):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Cannot register with driver: [driverUrl]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"executor/CoarseGrainedExecutorBackend/#messages","title":"Messages","text":""},{"location":"executor/CoarseGrainedExecutorBackend/#DecommissionExecutor","title":"DecommissionExecutor","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          DecommissionExecutor is sent out when CoarseGrainedSchedulerBackend is requested to decommissionExecutors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          When received, CoarseGrainedExecutorBackend decommissionSelf.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"executor/CoarseGrainedExecutorBackend/#RegisteredExecutor","title":"RegisteredExecutor","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          When received, CoarseGrainedExecutorBackend prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Successfully registered with driver\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          CoarseGrainedExecutorBackend initializes the single managed Executor (with the given executorId, the hostname) and sends a LaunchedExecutor message back to the driver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          RegisteredExecutor is sent out when CoarseGrainedSchedulerBackend has finished onStart successfully (and registered with the driver).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"executor/CoarseGrainedExecutorBackend/#logging","title":"Logging","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Enable ALL logging level for org.apache.spark.executor.CoarseGrainedExecutorBackend logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          logger.CoarseGrainedExecutorBackend.name = org.apache.spark.executor.CoarseGrainedExecutorBackend\nlogger.CoarseGrainedExecutorBackend.level = all\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"executor/Executor/","title":"Executor","text":""},{"location":"executor/Executor/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Executor takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Host name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • User-defined jars
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • isLocal flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • UncaughtExceptionHandler (default: SparkUncaughtExceptionHandler)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Resources (Map[String, ResourceInformation])

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • CoarseGrainedExecutorBackend is requested to handle a RegisteredExecutor message (after having registered with the driver)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LocalEndpoint is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"executor/Executor/#when-created","title":"When Created","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When created, Executor prints out the following INFO messages to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Starting executor ID [executorId] on host [executorHostname]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            (only for non-local modes) Executor sets SparkUncaughtExceptionHandler as the default handler invoked when a thread abruptly terminates due to an uncaught exception.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            (only for non-local modes) Executor requests the BlockManager to initialize (with the Spark application id of the SparkConf).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            (only for non-local modes) Executor requests the MetricsSystem to register the following metric sources:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ExecutorSource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • JVMCPUSource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ExecutorMetricsSource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffleMetricsSource (of the BlockManager)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor uses SparkEnv to access the MetricsSystem and BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor creates a task class loader (optionally with REPL support) and requests the system Serializer to use as the default classloader (for deserializing tasks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor starts sending heartbeats with the metrics of active tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"executor/Executor/#plugincontainer","title":"PluginContainer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor creates a PluginContainer (with the SparkEnv and the resources).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The PluginContainer is used to create a TaskRunner for launching a task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The PluginContainer is requested to shutdown in stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#executorsource","title":"ExecutorSource

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When created, Executor creates an ExecutorSource (with the threadPool, the executorId and the schemes).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The ExecutorSource is then registered with the application's MetricsSystem (in local and non-local modes) to report metrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The metrics are updated right after a TaskRunner has finished executing a task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#executormetricssource","title":"ExecutorMetricsSource

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor creates an ExecutorMetricsSource when created with the spark.metrics.executorMetricsSource.enabled enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor uses the ExecutorMetricsSource to create the ExecutorMetricsPoller.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor requests the ExecutorMetricsSource to register immediately when created with the isLocal flag disabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#executormetricspoller","title":"ExecutorMetricsPoller

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor creates an ExecutorMetricsPoller when created with the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MemoryManager of the SparkEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • spark.executor.metrics.pollingInterval
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ExecutorMetricsSource

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor requests the ExecutorMetricsPoller to start immediately when created and to stop when requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TaskRunner requests the ExecutorMetricsPoller to onTaskStart and onTaskCompletion at the beginning and the end of run, respectively.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When requested to reportHeartBeat with pollOnHeartbeat enabled, Executor requests the ExecutorMetricsPoller to poll.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#fetching-file-and-jar-dependencies","title":"Fetching File and Jar Dependencies
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            updateDependencies(\n  newFiles: Map[String, Long],\n  newJars: Map[String, Long]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            updateDependencies fetches missing or outdated extra files (in the given newFiles). For every name-timestamp pair that...FIXME..., updateDependencies prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Fetching [name] with timestamp [timestamp]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            updateDependencies fetches missing or outdated extra jars (in the given newJars). For every name-timestamp pair that...FIXME..., updateDependencies prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Fetching [name] with timestamp [timestamp]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            updateDependencies fetches the file to the SparkFiles root directory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            updateDependencies...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            updateDependencies is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskRunner is requested to start (and run a task)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#sparkdrivermaxresultsize","title":"spark.driver.maxResultSize

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor uses the spark.driver.maxResultSize for TaskRunner when requested to run a task (and decide on a serialized task result).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#maximum-size-of-direct-results","title":"Maximum Size of Direct Results

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor uses the minimum of spark.task.maxDirectResultSize and spark.rpc.message.maxSize when TaskRunner is requested to run a task (and decide on the type of a serialized task result).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#islocal-flag","title":"isLocal Flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor is given the isLocal flag when created to indicate a non-local mode (whether the executor and the Spark application runs with local or cluster-specific master URL).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isLocal is disabled (false) by default and is off explicitly when CoarseGrainedExecutorBackend is requested to handle a RegisteredExecutor message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isLocal is enabled (true) when LocalEndpoint is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#sparkexecutoruserclasspathfirst","title":"spark.executor.userClassPathFirst

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor reads the value of the spark.executor.userClassPathFirst configuration property when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When enabled, Executor uses ChildFirstURLClassLoader (not MutableURLClassLoader) when requested to createClassLoader (and addReplClassLoaderIfNeeded).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#user-defined-jars","title":"User-Defined Jars

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor is given user-defined jars when created. No jars are assumed by default.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The jars are specified using spark.executor.extraClassPath configuration property (via --user-class-path command-line option of CoarseGrainedExecutorBackend).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#running-tasks-registry","title":"Running Tasks Registry
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            runningTasks: Map[Long, TaskRunner]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor tracks TaskRunners by task IDs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#heartbeatreceiver-rpc-endpoint-reference","title":"HeartbeatReceiver RPC Endpoint Reference

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When created, Executor creates an RPC endpoint reference to HeartbeatReceiver (running on the driver).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor uses the RPC endpoint reference when requested to reportHeartBeat.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#launching-task","title":"Launching Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            launchTask(\n  context: ExecutorBackend,\n  taskDescription: TaskDescription): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            launchTask creates a TaskRunner (with the given ExecutorBackend, the TaskDescription and the PluginContainer) and adds it to the runningTasks internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            launchTask requests the \"Executor task launch worker\" thread pool to execute the TaskRunner (sometime in the future).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In case the decommissioned flag is enabled, launchTask prints out the following ERROR message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Launching a task while in decommissioned state.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            launchTask is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • CoarseGrainedExecutorBackend is requested to handle a LaunchTask message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LocalEndpoint RPC endpoint (of LocalSchedulerBackend) is requested to reviveOffers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#sending-heartbeats-and-active-tasks-metrics","title":"Sending Heartbeats and Active Tasks Metrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executors keep sending metrics for active tasks to the driver every spark.executor.heartbeatInterval (defaults to 10s with some random initial delay so the heartbeats from different executors do not pile up on the driver).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            An executor sends heartbeats using the Heartbeat Sender Thread.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For each task in TaskRunner (in runningTasks internal registry), the task's metrics are computed and become part of the heartbeat (with accumulators).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A blocking Heartbeat message that holds the executor id, all accumulator updates (per task id), and BlockManagerId is sent to HeartbeatReceiver RPC endpoint.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the response requests to re-register BlockManager, Executor prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Told to re-register on heartbeat\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockManager is requested to reregister.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The internal heartbeatFailures counter is reset.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If there are any issues with communicating with the driver, Executor prints out the following WARN message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Issue communicating with driver in heartbeater\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The internal heartbeatFailures is incremented and checked to be less than the spark.executor.heartbeat.maxFailures. If the number is greater, the following ERROR is printed out to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Exit as unable to send heartbeats to driver more than [HEARTBEAT_MAX_FAILURES] times\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The executor exits (using System.exit and exit code 56).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#heartbeat-sender-thread","title":"Heartbeat Sender Thread

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            heartbeater is a ScheduledThreadPoolExecutor (Java) with a single thread.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The name of the thread pool is driver-heartbeater.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#executor-task-launch-worker-thread-pool","title":"Executor task launch worker Thread Pool

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When created, Executor creates threadPool daemon cached thread pool with the name Executor task launch worker-[ID] (with ID being the task id).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The threadPool thread pool is used for launching tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#executor-memory","title":"Executor Memory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The amount of memory per executor is configured using spark.executor.memory configuration property. It sets the available memory equally for all executors per application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            You can find the value displayed as Memory per Node in the web UI of the standalone Master.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#heartbeating-with-partial-metrics-for-active-tasks-to-driver","title":"Heartbeating With Partial Metrics For Active Tasks To Driver
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            reportHeartBeat(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            reportHeartBeat collects TaskRunners for currently running tasks (active tasks) with their tasks deserialized (i.e. either ready for execution or already started).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TaskRunner has task deserialized when it runs the task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For every running task, reportHeartBeat takes the TaskMetrics and:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Requests ShuffleRead metrics to be merged
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Sets jvmGCTime metrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            reportHeartBeat then records the latest values of internal and external accumulators for every task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Internal accumulators are a task's metrics while external accumulators are a Spark application's accumulators that a user has created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            reportHeartBeat sends a blocking Heartbeat message to the HeartbeatReceiver (on the driver). reportHeartBeat uses the value of spark.executor.heartbeatInterval configuration property for the RPC timeout.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A Heartbeat message contains the executor identifier, the accumulator updates, and the identifier of the BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the response (from HeartbeatReceiver) is to re-register the BlockManager, reportHeartBeat prints out the following INFO message to the logs and requests the BlockManager to re-register (which will register the blocks the BlockManager manages with the driver).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Told to re-register on heartbeat\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            HeartbeatResponse requests the BlockManager to re-register when either TaskScheduler or HeartbeatReceiver know nothing about the executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When posting the Heartbeat was successful, reportHeartBeat resets heartbeatFailures internal counter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In case of a non-fatal exception, you should see the following WARN message in the logs (followed by the stack trace).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Issue communicating with driver in heartbeater\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Every failure reportHeartBeat increments heartbeat failures up to spark.executor.heartbeat.maxFailures configuration property. When the heartbeat failures reaches the maximum, reportHeartBeat prints out the following ERROR message to the logs and the executor terminates with the error code: 56.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Exit as unable to send heartbeats to driver more than [HEARTBEAT_MAX_FAILURES] times\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            reportHeartBeat is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Executor is requested to schedule reporting heartbeat and partial metrics for active tasks to the driver (that happens every spark.executor.heartbeatInterval).
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#sparkexecutorheartbeatmaxfailures","title":"spark.executor.heartbeat.maxFailures

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor uses spark.executor.heartbeat.maxFailures configuration property in reportHeartBeat.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/Executor/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Enable ALL logging level for org.apache.spark.executor.Executor logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            log4j.logger.org.apache.spark.executor.Executor=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"executor/ExecutorBackend/","title":"ExecutorBackend","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExecutorBackend is an abstraction of executor backends (that TaskRunners use to report task status updates to a scheduler).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExecutorBackend acts as a bridge between executors and the driver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"executor/ExecutorBackend/#contract","title":"Contract","text":""},{"location":"executor/ExecutorBackend/#statusUpdate","title":"Reporting Task Status","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            statusUpdate(\n  taskId: Long,\n  state: TaskState,\n  data: ByteBuffer): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Reports task status of the given task to a scheduler

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            See:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • CoarseGrainedExecutorBackend

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskRunner is requested to run a task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"executor/ExecutorBackend/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • CoarseGrainedExecutorBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LocalSchedulerBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MesosExecutorBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"executor/ExecutorLogUrlHandler/","title":"ExecutorLogUrlHandler","text":""},{"location":"executor/ExecutorLogUrlHandler/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExecutorLogUrlHandler takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Optional Log URL Pattern

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExecutorLogUrlHandler is created\u00a0for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • DriverEndpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • HistoryAppStatusStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"executor/ExecutorLogUrlHandler/#applying-pattern","title":"Applying Pattern
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              applyPattern(\n  logUrls: Map[String, String],\n  attributes: Map[String, String]): Map[String, String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              applyPattern doApplyPattern for logUrlPattern defined or simply returns the given logUrls back.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              applyPattern\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • DriverEndpoint is requested to handle a RegisterExecutor message (and creates a ExecutorData)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • HistoryAppStatusStore is requested to replaceLogUrls
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"executor/ExecutorLogUrlHandler/#doapplypattern","title":"doApplyPattern
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doApplyPattern(\n  logUrls: Map[String, String],\n  attributes: Map[String, String],\n  urlPattern: String): Map[String, String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doApplyPattern...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"executor/ExecutorMetricType/","title":"ExecutorMetricType","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExecutorMetricType is an abstraction of executor metric types.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"executor/ExecutorMetricType/#contract","title":"Contract","text":""},{"location":"executor/ExecutorMetricType/#metric-values","title":"Metric Values
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getMetricValues(\n  memoryManager: MemoryManager): Array[Long]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExecutorMetrics utility is used for the current metric values
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"executor/ExecutorMetricType/#metric-names","title":"Metric Names
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              names: Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExecutorMetricType utility is used for the metricToOffset and number of metrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"executor/ExecutorMetricType/#implementations","title":"Implementations","text":"Sealed Trait

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExecutorMetricType is a Scala sealed trait which means that all of the implementations are in the same compilation unit (a single file).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Learn more in the Scala Language Specification.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • GarbageCollectionMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ProcessTreeMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SingleValueExecutorMetricType
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • JVMHeapMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • JVMOffHeapMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • MBeanExecutorMetricType
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • DirectPoolMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • MappedPoolMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • MemoryManagerExecutorMetricType
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • OffHeapExecutionMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • OffHeapStorageMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • OffHeapUnifiedMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • OnHeapExecutionMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • OnHeapStorageMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • OnHeapUnifiedMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"executor/ExecutorMetricType/#executor-metric-getters-ordered-executormetrictypes","title":"Executor Metric Getters (Ordered ExecutorMetricTypes)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExecutorMetricType defines an ordered collection of ExecutorMetricTypes:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1. JVMHeapMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              2. JVMOffHeapMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              3. OnHeapExecutionMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              4. OffHeapExecutionMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              5. OnHeapStorageMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              6. OffHeapStorageMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              7. OnHeapUnifiedMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              8. OffHeapUnifiedMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              9. DirectPoolMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              10. MappedPoolMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              11. ProcessTreeMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              12. GarbageCollectionMetrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              This ordering allows for passing metric values as arrays (to save space) with indices being a metric of a metric type.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              metricGetters is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExecutorMetrics utility is used for the current metric values
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExecutorMetricType utility is used to get the metricToOffset and the numMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"executor/ExecutorMetrics/","title":"ExecutorMetrics","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExecutorMetrics is a collection of executor metrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","tags":["DeveloperApi"]},{"location":"executor/ExecutorMetrics/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExecutorMetrics takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Metrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ExecutorMetrics is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkContext is requested to reportHeartBeat
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • DAGScheduler is requested to post a SparkListenerTaskEnd event
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExecutorMetricsPoller is requested to getExecutorUpdates
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExecutorMetricsJsonDeserializer is requested to deserialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • JsonProtocol is requested to executorMetricsFromJson
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","tags":["DeveloperApi"]},{"location":"executor/ExecutorMetrics/#current-metric-values","title":"Current Metric Values
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getCurrentMetrics(\n  memoryManager: MemoryManager): Array[Long]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getCurrentMetrics gives metric values for every metric getter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Given that one metric getter (type) can report multiple metrics, the length of the result collection is the number of metrics (and at least the number of metric getters). The order matters and is exactly as metricGetters.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getCurrentMetrics is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkContext is requested to reportHeartBeat
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExecutorMetricsPoller is requested to poll
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":"","tags":["DeveloperApi"]},{"location":"executor/ExecutorMetricsPoller/","title":"ExecutorMetricsPoller","text":""},{"location":"executor/ExecutorMetricsPoller/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ExecutorMetricsPoller takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MemoryManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • spark.executor.metrics.pollingInterval
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExecutorMetricsSource

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorMetricsPoller is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"executor/ExecutorMetricsPoller/#executor-metrics-poller","title":"executor-metrics-poller

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorMetricsPoller creates a ScheduledExecutorService (Java) when created with the spark.executor.metrics.pollingInterval greater than 0.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The ScheduledExecutorService manages 1 daemon thread with executor-metrics-poller name prefix.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The ScheduledExecutorService is requested to poll at every pollingInterval when ExecutorMetricsPoller is requested to start until stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"executor/ExecutorMetricsPoller/#poll","title":"poll
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  poll(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  poll...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  poll is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor is requested to reportHeartBeat
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutorMetricsPoller is requested to start
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"executor/ExecutorMetricsSource/","title":"ExecutorMetricsSource","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorMetricsSource is a metrics source.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"executor/ExecutorMetricsSource/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorMetricsSource takes no arguments to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorMetricsSource is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is created (with spark.metrics.executorMetricsSource.enabled enabled)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor is created (with spark.metrics.executorMetricsSource.enabled enabled)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"executor/ExecutorMetricsSource/#source-name","title":"Source Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  sourceName: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  sourceName is ExecutorMetrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  sourceName is part of the Source abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"executor/ExecutorMetricsSource/#registering-with-metricssystem","title":"Registering with MetricsSystem
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  register(\n  metricsSystem: MetricsSystem): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  register creates ExecutorMetricGauges for every executor metric.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  register requests the MetricRegistry to register every metric type.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, register requests the MetricRegistry to register this ExecutorMetricsSource.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  register is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor is created (for non-local mode)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"executor/ExecutorMetricsSource/#metrics-snapshot","title":"Metrics Snapshot

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorMetricsSource defines metricsSnapshot internal registry of values of every metric.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The values are updated in updateMetricsSnapshot and read using ExecutorMetricGauges.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"executor/ExecutorMetricsSource/#updatemetricssnapshot","title":"updateMetricsSnapshot
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  updateMetricsSnapshot(\n  metricsUpdates: Array[Long]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  updateMetricsSnapshot updates the metricsSnapshot registry with the given metricsUpdates.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  updateMetricsSnapshot is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is requested to reportHeartBeat
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutorMetricsPoller is requested to poll
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"executor/ExecutorSource/","title":"ExecutorSource","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorSource is a Source of Executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"executor/ExecutorSource/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorSource takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ThreadPoolExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor ID (unused)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • File System Schemes (to report based on spark.executor.metrics.fileSystemSchemes)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorSource is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Executor is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"executor/ExecutorSource/#name","title":"Name

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorSource is known under the name executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"executor/ExecutorSource/#metrics","title":"Metrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    metricRegistry: MetricRegistry\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    metricRegistry is part of the Source abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Name Description threadpool.activeTasks Approximate number of threads that are actively executing tasks (based on ThreadPoolExecutor.getActiveCount) others","text":""},{"location":"executor/ShuffleReadMetrics/","title":"ShuffleReadMetrics","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleReadMetrics is a collection of metrics (accumulators) on reading shuffle data.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","tags":["DeveloperApi"]},{"location":"executor/ShuffleReadMetrics/#taskmetrics","title":"TaskMetrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleReadMetrics is available using TaskMetrics.shuffleReadMetrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"executor/ShuffleReadMetrics/#serializable","title":"Serializable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleReadMetrics is a Serializable (Java).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"executor/ShuffleWriteMetrics/","title":"ShuffleWriteMetrics","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleWriteMetrics is a ShuffleWriteMetricsReporter of metrics (accumulators) related to writing shuffle data (in shuffle map tasks):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Shuffle Bytes Written
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Shuffle Write Time
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Shuffle Records Written
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","tags":["DeveloperApi"]},{"location":"executor/ShuffleWriteMetrics/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleWriteMetrics takes no input arguments to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleWriteMetrics is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • TaskMetrics is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleExternalSorter is requested to writeSortedFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • MapIterator (of BytesToBytesMap) is requested to spill
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExternalAppendOnlyMap is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExternalSorter is requested to spillMemoryIteratorToDisk
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • UnsafeExternalSorter is requested to spill
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SpillableIterator (of UnsafeExternalSorter) is requested to spill
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","tags":["DeveloperApi"]},{"location":"executor/ShuffleWriteMetrics/#taskmetrics","title":"TaskMetrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleWriteMetrics is available using TaskMetrics.shuffleWriteMetrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"executor/ShuffleWriteMetrics/#serializable","title":"Serializable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleWriteMetrics is a Serializable (Java).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/","title":"TaskMetrics","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskMetrics is a collection of metrics (accumulators) tracked during execution of a task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskMetrics takes no input arguments to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskMetrics is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Stage is requested to makeNewStageAttempt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#metrics","title":"Metrics","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#shufflewritemetrics","title":"ShuffleWriteMetrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleWriteMetrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • shuffle.write.bytesWritten
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • shuffle.write.recordsWritten
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • shuffle.write.writeTime

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleWriteMetrics is exposed using Dropwizard metrics system using ExecutorSource (when TaskRunner is about to finish running):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • shuffleBytesWritten
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • shuffleRecordsWritten
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • shuffleWriteTime

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleWriteMetrics can be monitored using:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • StatsReportListener (when a stage completes)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • shuffle bytes written
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • JsonProtocol (when requested to taskMetricsToJson)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Shuffle Bytes Written
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Shuffle Write Time
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Shuffle Records Written

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    shuffleWriteMetrics is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleWriteProcessor is requested for a ShuffleWriteMetricsReporter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SortShuffleWriter is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • AppStatusListener is requested to handle a SparkListenerTaskEnd
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • LiveTask is requested to updateMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExternalSorter is requested to writePartitionedFile (to create a DiskBlockObjectWriter), writePartitionedMapOutput
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleExchangeExec (Spark SQL) is requested for a ShuffleWriteProcessor (to create a ShuffleDependency)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#memory-bytes-spilled","title":"Memory Bytes Spilled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Number of in-memory bytes spilled by the tasks (of a stage)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    _memoryBytesSpilled is a LongAccumulator with internal.metrics.memoryBytesSpilled name.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    memoryBytesSpilled metric is exposed using ExecutorSource as memoryBytesSpilled (using Dropwizard metrics system).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#memorybytesspilled","title":"memoryBytesSpilled","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    memoryBytesSpilled: Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    memoryBytesSpilled is the sum of all memory bytes spilled across all tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    memoryBytesSpilled is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SpillListener is requested to onStageCompleted
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • TaskRunner is requested to run (and updates task metrics in the Dropwizard metrics system)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • LiveTask is requested to updateMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • JsonProtocol is requested to taskMetricsToJson
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#incmemorybytesspilled","title":"incMemoryBytesSpilled","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    incMemoryBytesSpilled(\n  v: Long): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    incMemoryBytesSpilled adds the v value to the _memoryBytesSpilled metric.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    incMemoryBytesSpilled is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Aggregator is requested to updateMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BasePythonRunner.ReaderIterator is requested to handleTimingData
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • CoGroupedRDD is requested to compute a partition
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleExternalSorter is requested to spill
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • JsonProtocol is requested to taskMetricsFromJson
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExternalSorter is requested to insertAllAndUpdateMetrics, writePartitionedFile, writePartitionedMapOutput
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • UnsafeExternalSorter is requested to createWithExistingInMemorySorter, spill
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • UnsafeExternalSorter.SpillableIterator is requested to spill
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#taskcontext","title":"TaskContext

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskMetrics is available using TaskContext.taskMetrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskContext.get.taskMetrics\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#serializable","title":"Serializable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskMetrics is a Serializable (Java).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#task","title":"Task

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskMetrics is part of Task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    task.metrics\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#sparklistener","title":"SparkListener

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskMetrics is available using SparkListener and intercepting SparkListenerTaskEnd events.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#statsreportlistener","title":"StatsReportListener

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    StatsReportListener can be used for summary statistics at runtime (after a stage completes).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#spark-history-server","title":"Spark History Server

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Spark History Server uses EventLoggingListener to intercept post-execution statistics (incl. TaskMetrics).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskRunner/","title":"TaskRunner","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskRunner is a thread of execution to run a task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Internal Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskRunner is an internal class of Executor with full access to internal registries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskRunner is a java.lang.Runnable so once a TaskRunner has completed execution it must not be restarted.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"executor/TaskRunner/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskRunner takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExecutorBackend (that manages the parent Executor)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • TaskDescription
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • PluginContainer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • TaskRunner is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Executor is requested to launch a task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"executor/TaskRunner/#plugincontainer","title":"PluginContainer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskRunner may be given a PluginContainer when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The PluginContainer is used when TaskRunner is requested to run (for the Task to run).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"executor/TaskRunner/#demo","title":"Demo
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ./bin/spark-shell --conf spark.driver.maxResultSize=1m\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      scala> println(sc.version)\n3.0.1\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      val maxResultSize = sc.getConf.get(\"spark.driver.maxResultSize\")\nassert(maxResultSize == \"1m\")\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      val rddOver1m = sc.range(0, 1024 * 1024 + 10, 1)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      scala> rddOver1m.collect\nERROR TaskSetManager: Total size of serialized results of 2 tasks (1030.8 KiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\nERROR TaskSetManager: Total size of serialized results of 3 tasks (1546.2 KiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\nERROR TaskSetManager: Total size of serialized results of 4 tasks (2.0 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\nWARN TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7, 192.168.68.105, executor driver): TaskKilled (Tasks result size has exceeded maxResultSize)\nWARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, 192.168.68.105, executor driver): TaskKilled (Tasks result size has exceeded maxResultSize)\nWARN TaskSetManager: Lost task 12.0 in stage 0.0 (TID 12, 192.168.68.105, executor driver): TaskKilled (Tasks result size has exceeded maxResultSize)\nERROR TaskSetManager: Total size of serialized results of 5 tasks (2.5 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\nWARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8, 192.168.68.105, executor driver): TaskKilled (Tasks result size has exceeded maxResultSize)\n...\norg.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 2 tasks (1030.8 KiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\n  at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)\n  ...\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"executor/TaskRunner/#thread-name","title":"Thread Name

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskRunner uses the following thread name (with the taskId of the TaskDescription):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Executor task launch worker for task [taskId]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"executor/TaskRunner/#running-task","title":"Running Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run is part of the java.lang.Runnable abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"executor/TaskRunner/#initialization","title":"Initialization

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run initializes the threadId internal registry as the current thread identifier (using Thread.getId).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run sets the name of the current thread of execution as the threadName.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run creates a TaskMemoryManager (for the current MemoryManager and taskId). run uses SparkEnv to access the current MemoryManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run starts tracking the time to deserialize a task and sets the current thread's context classloader.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run creates a closure Serializer. run uses SparkEnv to access the closure Serializer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run prints out the following INFO message to the logs (with the taskName and taskId):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Running [taskName] (TID [taskId])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run notifies the ExecutorBackend that the status of the task has changed to RUNNING (for the taskId).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run computes the total amount of time this JVM process has spent in garbage collection.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run uses the addedFiles and addedJars (of the given TaskDescription) to update dependencies.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run takes the serializedTask of the given TaskDescription and requests the closure Serializer to deserialize the task. run sets the task internal reference to hold the deserialized task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      For non-local environments, run prints out the following DEBUG message to the logs before requesting the MapOutputTrackerWorker to update the epoch (using the epoch of the Task to be executed). run uses SparkEnv to access the MapOutputTrackerWorker.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Task [taskId]'s epoch is [epoch]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run requests the metricsPoller...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run records the current time as the task's start time (taskStartTimeNs).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run requests the Task to run (with taskAttemptId as taskId, attemptNumber from TaskDescription, and metricsSystem as the current MetricsSystem).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run uses SparkEnv to access the MetricsSystem.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The task runs inside a \"monitored\" block (try-finally block) to detect any memory and lock leaks after the task's run finishes regardless of the final outcome - the computed value or an exception thrown.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run creates a Serializer and requests it to serialize the task result (valueBytes).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run uses SparkEnv to access the Serializer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run updates the metrics of the Task executed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run updates the metric counters in the ExecutorSource.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run requests the Task executed for accumulator updates and the ExecutorMetricsPoller for metric peaks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"executor/TaskRunner/#serialized-task-result","title":"Serialized Task Result

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run creates a DirectTaskResult (with the serialized task result, the accumulator updates and the metric peaks) and requests the closure Serializer to serialize it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The serialized DirectTaskResult is a java.nio.ByteBuffer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run selects between the DirectTaskResult and an IndirectTaskResult based on the size of the serialized task result (limit of this serializedDirectResult byte buffer):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      1. With the size above spark.driver.maxResultSize, run prints out the following WARN message to the logs and serializes an IndirectTaskResult with a TaskResultBlockId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Finished [taskName] (TID [taskId]). Result is larger than maxResultSize ([resultSize] > [maxResultSize]), dropping it.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      2. With the size above maxDirectResultSize, run creates an TaskResultBlockId and requests the BlockManager to store the task result locally (with MEMORY_AND_DISK_SER). run prints out the following INFO message to the logs and serializes an IndirectTaskResult with a TaskResultBlockId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Finished [taskName] (TID [taskId]). [resultSize] bytes result sent via BlockManager)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      3. run prints out the following INFO message to the logs and uses the DirectTaskResult created earlier.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Finished [taskName] (TID [taskId]). [resultSize] bytes result sent to driver\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      serializedResult is either a IndirectTaskResult (possibly with the block stored in BlockManager) or a DirectTaskResult.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"executor/TaskRunner/#incrementing-succeededtasks-counter","title":"Incrementing succeededTasks Counter

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run requests the ExecutorSource to increment succeededTasks counter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"executor/TaskRunner/#marking-task-finished","title":"Marking Task Finished

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run setTaskFinishedAndClearInterruptStatus.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"executor/TaskRunner/#notifying-executorbackend-that-task-finished","title":"Notifying ExecutorBackend that Task Finished

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run notifies the ExecutorBackend that the status of the taskId has changed to FINISHED.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ExecutorBackend is given when the TaskRunner is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"executor/TaskRunner/#wrapping-up","title":"Wrapping Up

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In the end, regardless of the task's execution status (successful or failed), run removes the taskId from runningTasks registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In case a onTaskStart notification was sent out, run requests the ExecutorMetricsPoller to onTaskCompletion.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"executor/TaskRunner/#exceptions","title":"Exceptions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run handles certain exceptions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Exception Type TaskState Serialized ByteBuffer FetchFailedException FAILED TaskFailedReason TaskKilledException KILLED TaskKilled InterruptedException KILLED TaskKilled CommitDeniedException FAILED TaskFailedReason Throwable FAILED ExceptionFailure","text":""},{"location":"executor/TaskRunner/#fetchfailedexception","title":"FetchFailedException

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      When shuffle:FetchFailedException.md[FetchFailedException] is reported while running a task, run <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run shuffle:FetchFailedException.md#toTaskFailedReason[requests FetchFailedException for the TaskFailedReason], serializes it and ExecutorBackend.md#statusUpdate[notifies ExecutorBackend that the task has failed] (with <>, TaskState.FAILED, and a serialized reason).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: ExecutorBackend was specified when <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: run uses a closure serializer:Serializer.md[Serializer] to serialize the failure reason. The Serializer was created before run ran the task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"executor/TaskRunner/#taskkilledexception","title":"TaskKilledException

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      When TaskKilledException is reported while running a task, you should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Executor killed [taskName] (TID [taskId]), reason: [reason]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run then <> and ExecutorBackend.md#statusUpdate[notifies ExecutorBackend that the task has been killed] (with <>, TaskState.KILLED, and a serialized TaskKilled object).","text":""},{"location":"executor/TaskRunner/#interruptedexception-with-task-killed","title":"InterruptedException (with Task Killed)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      When InterruptedException is reported while running a task, and the task has been killed, you should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Executor interrupted and killed [taskName] (TID [taskId]), reason: [killReason]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run then <> and ExecutorBackend.md#statusUpdate[notifies ExecutorBackend that the task has been killed] (with <>, TaskState.KILLED, and a serialized TaskKilled object).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: The difference between this InterruptedException and <> is the INFO message in the logs.","text":""},{"location":"executor/TaskRunner/#commitdeniedexception","title":"CommitDeniedException

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      When CommitDeniedException is reported while running a task, run <> and ExecutorBackend.md#statusUpdate[notifies ExecutorBackend that the task has failed] (with <>, TaskState.FAILED, and a serialized TaskKilled object).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: The difference between this CommitDeniedException and <> is just the reason being sent to ExecutorBackend.","text":""},{"location":"executor/TaskRunner/#throwable","title":"Throwable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      When run catches a Throwable, you should see the following ERROR message in the logs (followed by the exception).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Exception in [taskName] (TID [taskId])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run then records the following task metrics (only when <> is available):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskMetrics.md#setExecutorRunTime[executorRunTime]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskMetrics.md#setJvmGCTime[jvmGCTime]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run then scheduler:Task.md#collectAccumulatorUpdates[collects the latest values of internal and external accumulators] (with taskFailed flag enabled to inform that the collection is for a failed task).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Otherwise, when <> is not available, the accumulator collection is empty.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run converts the task accumulators to collection of AccumulableInfo, creates a ExceptionFailure (with the accumulators), and serializer:Serializer.md#serialize[serializes them].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: run uses a closure serializer:Serializer.md[Serializer] to serialize the ExceptionFailure.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      CAUTION: FIXME Why does run create new ExceptionFailure(t, accUpdates).withAccums(accums), i.e. accumulators occur twice in the object.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run <> and ExecutorBackend.md#statusUpdate[notifies ExecutorBackend that the task has failed] (with <>, TaskState.FAILED, and the serialized ExceptionFailure).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run may also trigger SparkUncaughtExceptionHandler.uncaughtException(t) if this is a fatal error.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: The difference between this most Throwable case and other FAILED cases (i.e. <> and <>) is just the serialized ExceptionFailure vs a reason being sent to ExecutorBackend, respectively.","text":""},{"location":"executor/TaskRunner/#collectaccumulatorsandresetstatusonfailure","title":"collectAccumulatorsAndResetStatusOnFailure

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      collectAccumulatorsAndResetStatusOnFailure(\n  taskStartTimeNs: Long)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      collectAccumulatorsAndResetStatusOnFailure...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"executor/TaskRunner/#killing-task","title":"Killing Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      kill(\n  interruptThread: Boolean,\n  reason: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      kill marks the TaskRunner as <> and scheduler:Task.md#kill[kills the task] (if available and not <> already).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: kill passes the input interruptThread on to the task itself while killing it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      When executed, you should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Executor is trying to kill [taskName] (TID [taskId]), reason: [reason]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: <> flag is checked periodically in <> to stop executing the task. Once killed, the task will eventually stop.","text":""},{"location":"executor/TaskRunner/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Enable ALL logging level for org.apache.spark.executor.Executor logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      log4j.logger.org.apache.spark.executor.Executor=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"executor/TaskRunner/#internal-properties","title":"Internal Properties","text":""},{"location":"executor/TaskRunner/#finished-flag","title":"finished Flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      finished flag says whether the <> has finished (true) or not (false)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Enabled (true) after TaskRunner has been requested to <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Used when TaskRunner is requested to <>","text":""},{"location":"executor/TaskRunner/#reasonifkilled","title":"reasonIfKilled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Reason to <> (and avoid <>)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Default: (empty) (None)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"executor/TaskRunner/#startgctime-timestamp","title":"startGCTime Timestamp

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Timestamp (which is really the Executor.md#computeTotalGcTime[total amount of time this Executor JVM process has already spent in garbage collection]) that is used to mark the GC \"zero\" time (when <>) and then compute the JVM GC time metric when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskRunner is requested to <> and <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Executor is requested to Executor.md#reportHeartBeat[reportHeartBeat]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ","text":""},{"location":"executor/TaskRunner/#task","title":"Task

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Deserialized scheduler:Task.md[task] to execute

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskRunner is requested to <>, <>, <>, <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Executor is requested to Executor.md#reportHeartBeat[reportHeartBeat]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ","text":""},{"location":"executor/TaskRunner/#task-name","title":"Task Name

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The name of the task (of the TaskDescription) that is used exclusively for <> purposes when TaskRunner is requested to <> and <> the task","text":""},{"location":"executor/TaskRunner/#thread-id","title":"Thread Id

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Current thread ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Default: -1

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Set immediately when TaskRunner is requested to <> and used exclusively when TaskReaper is requested for the thread info of the current thread (aka thread dump)","text":""},{"location":"exercises/spark-examples-wordcount-spark-shell/","title":"WordCount using Spark shell","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == WordCount using Spark shell

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          It is like any introductory big data example should somehow demonstrate how to count words in distributed fashion.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In the following example you're going to count the words in README.md file that sits in your Spark distribution and save the result under README.count directory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          You're going to use spark-shell.md[the Spark shell] for the example. Execute spark-shell.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"exercises/spark-examples-wordcount-spark-shell/#sourcescala","title":"[source,scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          val lines = sc.textFile(\"README.md\") // <1>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          val words = lines.flatMap(_.split(\"\\s+\")) // <2>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          val wc = words.map(w => (w, 1)).reduceByKey(_ + _) // <3>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"exercises/spark-examples-wordcount-spark-shell/#wcsaveastextfilereadmecount-4","title":"wc.saveAsTextFile(\"README.count\") // <4>","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          <1> Read the text file - refer to spark-io.md[Using Input and Output (I/O)]. <2> Split each line into words and flatten the result. <3> Map each word into a pair and count them by word (key). <4> Save the result into text files - one per partition.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          After you have executed the example, see the contents of the README.count directory:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ ls -lt README.count\ntotal 16\n-rw-r--r--  1 jacek  staff     0  9 pa\u017a 13:36 _SUCCESS\n-rw-r--r--  1 jacek  staff  1963  9 pa\u017a 13:36 part-00000\n-rw-r--r--  1 jacek  staff  1663  9 pa\u017a 13:36 part-00001\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The files part-0000x contain the pairs of word and the count.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ cat README.count/part-00000\n(package,1)\n(this,1)\n(Version\"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version),1)\n(Because,1)\n(Python,2)\n(cluster.,1)\n(its,1)\n([run,1)\n...\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === Further (self-)development

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Please read the questions and give answers first before looking at the link given.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          1. Why are there two files under the directory?
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2. How could you have only one?
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          3. How to filter out words by name?
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          4. How to count words?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Please refer to the chapter spark-rdd-partitions.md[Partitions] to find some of the answers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"exercises/spark-exercise-custom-scheduler-listener/","title":"Developing Custom SparkListener to monitor DAGScheduler in Scala","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == Exercise: Developing Custom SparkListener to monitor DAGScheduler in Scala

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The example shows how to develop a custom Spark Listener. You should read SparkListener.md[] first to understand the motivation for the example.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === Requirements

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          1. https://www.jetbrains.com/idea/[IntelliJ IDEA] (or eventually http://www.scala-sbt.org/[sbt] alone if you're adventurous).
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2. Access to Internet to download Apache Spark's dependencies.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === Setting up Scala project using IntelliJ IDEA

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Create a new project custom-spark-listener.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Add the following line to build.sbt (the main configuration file for the sbt project) that adds the dependency on Apache Spark.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          libraryDependencies += \"org.apache.spark\" %% \"spark-core\" % \"2.0.1\"\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          build.sbt should look as follows:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"exercises/spark-exercise-custom-scheduler-listener/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          name := \"custom-spark-listener\" organization := \"pl.jaceklaskowski.spark\" version := \"1.0\"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          scalaVersion := \"2.11.8\"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"exercises/spark-exercise-custom-scheduler-listener/#librarydependencies-orgapachespark-spark-core-201","title":"libraryDependencies += \"org.apache.spark\" %% \"spark-core\" % \"2.0.1\"","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === Custom Listener - pl.jaceklaskowski.spark.CustomSparkListener

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Create a Scala class -- CustomSparkListener -- for your custom SparkListener. It should be under src/main/scala directory (create one if it does not exist).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The aim of the class is to intercept scheduler events about jobs being started and tasks completed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"exercises/spark-exercise-custom-scheduler-listener/#sourcescala","title":"[source,scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          package pl.jaceklaskowski.spark

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          import org.apache.spark.scheduler.{SparkListenerStageCompleted, SparkListener, SparkListenerJobStart}

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          class CustomSparkListener extends SparkListener { override def onJobStart(jobStart: SparkListenerJobStart) { println(s\"Job started with ${jobStart.stageInfos.size} stages: $jobStart\") }

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit = { println(s\"Stage ${stageCompleted.stageInfo.stageId} completed with ${stageCompleted.stageInfo.numTasks} tasks.\") } }

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === Creating deployable package

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Package the custom Spark listener. Execute sbt package command in the custom-spark-listener project's main directory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ sbt package\n[info] Loading global plugins from /Users/jacek/.sbt/0.13/plugins\n[info] Loading project definition from /Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/project\n[info] Updating {file:/Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/project/}custom-spark-listener-build...\n[info] Resolving org.fusesource.jansi#jansi;1.4 ...\n[info] Done updating.\n[info] Set current project to custom-spark-listener (in build file:/Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/)\n[info] Updating {file:/Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/}custom-spark-listener...\n[info] Resolving jline#jline;2.12.1 ...\n[info] Done updating.\n[info] Compiling 1 Scala source to /Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/target/scala-2.11/classes...\n[info] Packaging /Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/target/scala-2.11/custom-spark-listener_2.11-1.0.jar ...\n[info] Done packaging.\n[success] Total time: 8 s, completed Oct 27, 2016 11:23:50 AM\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          You should find the result jar file with the custom scheduler listener ready under target/scala-2.11 directory, e.g. target/scala-2.11/custom-spark-listener_2.11-1.0.jar.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === Activating Custom Listener in Spark shell

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Start ../spark-shell.md[spark-shell] with additional configurations for the extra custom listener and the jar that includes the class.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          $ spark-shell \\\n  --conf spark.logConf=true \\\n  --conf spark.extraListeners=pl.jaceklaskowski.spark.CustomSparkListener \\\n  --driver-class-path target/scala-2.11/custom-spark-listener_2.11-1.0.jar\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Create a ../spark-sql-Dataset.md#implicits[Dataset] and execute an action like show to start a job as follows:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          scala> spark.read.text(\"README.md\").count\n[CustomSparkListener] Job started with 2 stages: SparkListenerJobStart(1,1473946006715,WrappedArray(org.apache.spark.scheduler.StageInfo@71515592, org.apache.spark.scheduler.StageInfo@6852819d),{spark.rdd.scope.noOverride=true, spark.rdd.scope={\"id\":\"14\",\"name\":\"collect\"}, spark.sql.execution.id=2})\n[CustomSparkListener] Stage 1 completed with 1 tasks.\n[CustomSparkListener] Stage 2 completed with 1 tasks.\nres0: Long = 7\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The lines with [CustomSparkListener] came from your custom Spark listener. Congratulations! The exercise's over.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === BONUS Activating Custom Listener in Spark Application

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          TIP: Read SparkContext.md#addSparkListener[Registering SparkListener].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === Questions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          1. What are the pros and cons of using the command line version vs inside a Spark application?
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/","title":"Working with Datasets from JDBC Data Sources (and PostgreSQL)","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == Working with Datasets from JDBC Data Sources (and PostgreSQL)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Start spark-shell with the JDBC driver for the database you want to use. In our case, it is PostgreSQL JDBC Driver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: Download the jar for PostgreSQL JDBC Driver 42.1.1 directly from the http://central.maven.org/maven2/org/postgresql/postgresql/42.1.1/postgresql-42.1.1.jar[Maven repository].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#tip","title":"[TIP]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Execute the command to have the jar downloaded into ~/.ivy2/jars directory by spark-shell itself:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ./bin/spark-shell --packages org.postgresql:postgresql:42.1.1\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The entire path to the driver file is then like /Users/jacek/.ivy2/jars/org.postgresql_postgresql-42.1.1.jar.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          You should see the following while spark-shell downloads the driver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#ivy-default-cache-set-to-usersjacekivy2cache-the-jars-for-the-packages-stored-in-usersjacekivy2jars-loading-settings-url-jarfileusersjacekdevosssparkassemblytargetscala-211jarsivy-240jarorgapacheivycoresettingsivysettingsxml-orgpostgresqlpostgresql-added-as-a-dependency-resolving-dependencies-orgapachesparkspark-submit-parent10-confs-default-found-orgpostgresqlpostgresql4211-in-central-downloading-httpsrepo1mavenorgmaven2orgpostgresqlpostgresql4211postgresql-4211jar-successful-orgpostgresqlpostgresql4211postgresqljarbundle-205ms-resolution-report-resolve-1887ms-artifacts-dl-207ms-modules-in-use-orgpostgresqlpostgresql4211-from-central-in-default-modules-artifacts-conf-number-searchdwnldedevicted-numberdwnlded-default-1-1-1-0-1-1-retrieving-orgapachesparkspark-submit-parent-confs-default-1-artifacts-copied-0-already-retrieved-695kb8ms","title":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Ivy Default Cache set to: /Users/jacek/.ivy2/cache\nThe jars for the packages stored in: /Users/jacek/.ivy2/jars\n:: loading settings :: url = jar:file:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\norg.postgresql#postgresql added as a dependency\n:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0\n    confs: [default]\n    found org.postgresql#postgresql;42.1.1 in central\ndownloading https://repo1.maven.org/maven2/org/postgresql/postgresql/42.1.1/postgresql-42.1.1.jar ...\n    [SUCCESSFUL ] org.postgresql#postgresql;42.1.1!postgresql.jar(bundle) (205ms)\n:: resolution report :: resolve 1887ms :: artifacts dl 207ms\n    :: modules in use:\n    org.postgresql#postgresql;42.1.1 from central in [default]\n    ---------------------------------------------------------------------\n    |                  |            modules            ||   artifacts   |\n    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|\n    ---------------------------------------------------------------------\n    |      default     |   1   |   1   |   1   |   0   ||   1   |   1   |\n    ---------------------------------------------------------------------\n:: retrieving :: org.apache.spark#spark-submit-parent\n    confs: [default]\n    1 artifacts copied, 0 already retrieved (695kB/8ms)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Start ./bin/spark-shell with spark-submit/index.md#driver-class-path[--driver-class-path] command line option and the driver jar.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell --driver-class-path /Users/jacek/.ivy2/jars/org.postgresql_postgresql-42.1.1.jar\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          It will give you the proper setup for accessing PostgreSQL using the JDBC driver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Execute the following to access projects table in sparkdb.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          // that gives an one-partition Dataset val opts = Map( \"url\" -> \"jdbc:postgresql:sparkdb\", \"dbtable\" -> \"projects\") val df = spark. read. format(\"jdbc\"). options(opts). load

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: Use user and password options to specify the credentials if needed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          // Note the number of partition (aka numPartitions) scala> df.explain == Physical Plan == *Scan JDBCRelation(projects) [numPartitions=1] [id#0,name#1,website#2] ReadSchema: struct

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          scala> df.show(truncate = false) +---+------------+-----------------------+ |id |name |website | +---+------------+-----------------------+ |1 |Apache Spark|http://spark.apache.org| |2 |Apache Hive |http://hive.apache.org | |3 |Apache Kafka|http://kafka.apache.org| |4 |Apache Flink|http://flink.apache.org| +---+------------+-----------------------+

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          // use jdbc method with predicates to define partitions import java.util.Properties val df4parts = spark. read. jdbc( url = \"jdbc:postgresql:sparkdb\", table = \"projects\", predicates = Array(\"id=1\", \"id=2\", \"id=3\", \"id=4\"), connectionProperties = new Properties())

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          // Note the number of partitions (aka numPartitions) scala> df4parts.explain == Physical Plan == *Scan JDBCRelation(projects) [numPartitions=4] [id#16,name#17,website#18] ReadSchema: struct

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          scala> df4parts.show(truncate = false) +---+------------+-----------------------+ |id |name |website | +---+------------+-----------------------+ |1 |Apache Spark|http://spark.apache.org| |2 |Apache Hive |http://hive.apache.org | |3 |Apache Kafka|http://kafka.apache.org| |4 |Apache Flink|http://flink.apache.org| +---+------------+-----------------------+

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === Troubleshooting

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          If things can go wrong, they sooner or later go wrong. Here is a list of possible issues and their solutions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ==== java.sql.SQLException: No suitable driver

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Ensure that the JDBC driver sits on the CLASSPATH. Use spark-submit/index.md#driver-class-path[--driver-class-path] as described above (--packages or --jars do not work).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          scala> val df = spark.\n     |   read.\n     |   format(\"jdbc\").\n     |   options(opts).\n     |   load\njava.sql.SQLException: No suitable driver\n  at java.sql.DriverManager.getDriver(DriverManager.java:315)\n  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)\n  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)\n  at scala.Option.getOrElse(Option.scala:121)\n  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:83)\n  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:34)\n  at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)\n  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:301)\n  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)\n  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:158)\n  ... 52 elided\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === PostgreSQL Setup

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: I'm on Mac OS X so YMMV (aka Your Mileage May Vary).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Use the sections to have a properly configured PostgreSQL database.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • <>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • <>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • <>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • <>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • <>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • <>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ==== [[installation]] Installation

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Install PostgreSQL as described in...TK

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: This page serves as a cheatsheet for the author so he does not have to search Internet to find the installation steps.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ initdb /usr/local/var/postgres -E utf8\nThe files belonging to this database system will be owned by user \"jacek\".\nThis user must also own the server process.\n\nThe database cluster will be initialized with locale \"pl_pl.utf-8\".\ninitdb: could not find suitable text search configuration for locale \"pl_pl.utf-8\"\nThe default text search configuration will be set to \"simple\".\n\nData page checksums are disabled.\n\ncreating directory /usr/local/var/postgres ... ok\ncreating subdirectories ... ok\nselecting default max_connections ... 100\nselecting default shared_buffers ... 128MB\nselecting dynamic shared memory implementation ... posix\ncreating configuration files ... ok\ncreating template1 database in /usr/local/var/postgres/base/1 ... ok\ninitializing pg_authid ... ok\ninitializing dependencies ... ok\ncreating system views ... ok\nloading system objects' descriptions ... ok\ncreating collations ... ok\ncreating conversions ... ok\ncreating dictionaries ... ok\nsetting privileges on built-in objects ... ok\ncreating information schema ... ok\nloading PL/pgSQL server-side language ... ok\nvacuuming database template1 ... ok\ncopying template1 to template0 ... ok\ncopying template1 to postgres ... ok\nsyncing data to disk ... ok\n\nWARNING: enabling \"trust\" authentication for local connections\nYou can change this by editing pg_hba.conf or using the option -A, or\n--auth-local and --auth-host, the next time you run initdb.\n\nSuccess. You can now start the database server using:\n\n    pg_ctl -D /usr/local/var/postgres -l logfile start\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ==== [[starting-database-server]] Starting Database Server

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: Consult http://www.postgresql.org/docs/current/static/server-start.html[17.3. Starting the Database Server] in the official documentation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#tip_1","title":"[TIP]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Enable all logs in PostgreSQL to see query statements.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            log_statement = 'all'\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#add-log_statement-all-to-usrlocalvarpostgrespostgresqlconf-on-mac-os-x-with-postgresql-installed-using-brew","title":"Add log_statement = 'all' to /usr/local/var/postgres/postgresql.conf on Mac OS X with PostgreSQL installed using brew.","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Start the database server using pg_ctl.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ pg_ctl -D /usr/local/var/postgres -l logfile start\nserver starting\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Alternatively, you can run the database server using postgres.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ postgres -D /usr/local/var/postgres\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ==== [[creating-database]] Create Database

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ createdb sparkdb\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TIP: Consult http://www.postgresql.org/docs/current/static/app-createdb.html[createdb] in the official documentation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ==== Accessing Database

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Use psql sparkdb to access the database.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ psql sparkdb\npsql (9.6.2)\nType \"help\" for help.\n\nsparkdb=#\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Execute SELECT version() to know the version of the database server you have connected to.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            sparkdb=# SELECT version();\n                                                   version\n--------------------------------------------------------------------------------------------------------------\n PostgreSQL 9.6.2 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.2 (clang-700.1.81), 64-bit\n(1 row)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Use \\h for help and \\q to leave a session.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ==== Creating Table

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Create a table using CREATE TABLE command.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CREATE TABLE projects (\n  id SERIAL PRIMARY KEY,\n  name text,\n  website text\n);\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Insert rows to initialize the table with data.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            INSERT INTO projects (name, website) VALUES ('Apache Spark', 'http://spark.apache.org');\nINSERT INTO projects (name, website) VALUES ('Apache Hive', 'http://hive.apache.org');\nINSERT INTO projects VALUES (DEFAULT, 'Apache Kafka', 'http://kafka.apache.org');\nINSERT INTO projects VALUES (DEFAULT, 'Apache Flink', 'http://flink.apache.org');\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Execute select * from projects; to ensure that you have the following records in projects table:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            sparkdb=# select * from projects;\n id |     name     |         website\n----+--------------+-------------------------\n  1 | Apache Spark | http://spark.apache.org\n  2 | Apache Hive  | http://hive.apache.org\n  3 | Apache Kafka | http://kafka.apache.org\n  4 | Apache Flink | http://flink.apache.org\n(4 rows)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ==== Dropping Database

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ dropdb sparkdb\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TIP: Consult http://www.postgresql.org/docs/current/static/app-dropdb.html[dropdb] in the official documentation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ==== Stopping Database Server

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            pg_ctl -D /usr/local/var/postgres stop\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"exercises/spark-exercise-failing-stage/","title":"Causing Stage to Fail","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == Exercise: Causing Stage to Fail

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The example shows how Spark re-executes a stage in case of stage failure.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === Recipe

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Start a Spark cluster, e.g. 1-node Hadoop YARN.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            start-yarn.sh\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            // 2-stage job -- it _appears_ that a stage can be failed only when there is a shuffle\nsc.parallelize(0 to 3e3.toInt, 2).map(n => (n % 2, n)).groupByKey.count\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Use 2 executors at least so you can kill one and keep the application up and running (on one executor).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            YARN_CONF_DIR=hadoop-conf ./bin/spark-shell --master yarn \\\n  -c spark.shuffle.service.enabled=true \\\n  --num-executors 2\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"exercises/spark-exercise-pairrddfunctions-oneliners/","title":"One-liners using PairRDDFunctions","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == Exercise: One-liners using PairRDDFunctions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            This is a set of one-liners to give you a entry point into using rdd:PairRDDFunctions.md[PairRDDFunctions].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === Exercise

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            How would you go about solving a requirement to pair elements of the same key and creating a new RDD out of the matched values?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"exercises/spark-exercise-pairrddfunctions-oneliners/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            val users = Seq((1, \"user1\"), (1, \"user2\"), (2, \"user1\"), (2, \"user3\"), (3,\"user2\"), (3,\"user4\"), (3,\"user1\"))

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            // Input RDD val us = sc.parallelize(users)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            // ...your code here

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            // Desired output Seq(\"user1\",\"user2\"),(\"user1\",\"user3\"),(\"user1\",\"user4\"),(\"user2\",\"user4\"))

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"exercises/spark-exercise-standalone-master-ha/","title":"Spark Standalone - Using ZooKeeper for High-Availability of Master","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == Spark Standalone - Using ZooKeeper for High-Availability of Master

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TIP: Read ../spark-standalone-Master.md#recovery-mode[Recovery Mode] to know the theory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            You're going to start two standalone Masters.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            You'll need 4 terminals (adjust addresses as needed):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Start ZooKeeper.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Create a configuration file ha.conf with the content as follows:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spark.deploy.recoveryMode=ZOOKEEPER\nspark.deploy.zookeeper.url=<zookeeper_host>:2181\nspark.deploy.zookeeper.dir=/spark\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Start the first standalone Master.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ./sbin/start-master.sh -h localhost -p 7077 --webui-port 8080 --properties-file ha.conf\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Start the second standalone Master.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: It is not possible to start another instance of standalone Master on the same machine using ./sbin/start-master.sh. The reason is that the script assumes one instance per machine only. We're going to change the script to make it possible.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ cp ./sbin/start-master{,-2}.sh\n\n$ grep \"CLASS 1\" ./sbin/start-master-2.sh\n\"$\\{SPARK_HOME}/sbin\"/spark-daemon.sh start $CLASS 1 \\\n\n$ sed -i -e 's/CLASS 1/CLASS 2/' sbin/start-master-2.sh\n\n$ grep \"CLASS 1\" ./sbin/start-master-2.sh\n\n$ grep \"CLASS 2\" ./sbin/start-master-2.sh\n\"$\\{SPARK_HOME}/sbin\"/spark-daemon.sh start $CLASS 2 \\\n\n$ ./sbin/start-master-2.sh -h localhost -p 17077 --webui-port 18080 --properties-file ha.conf\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            You can check how many instances you're currently running using jps command as follows:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ jps -lm\n5024 sun.tools.jps.Jps -lm\n4994 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8080 -h localhost -p 17077 --webui-port 18080 --properties-file ha.conf\n4808 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8080 -h localhost -p 7077 --webui-port 8080 --properties-file ha.conf\n4778 org.apache.zookeeper.server.quorum.QuorumPeerMain config/zookeeper.properties\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Start a standalone Worker.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ./sbin/start-slave.sh spark://localhost:7077,localhost:17077\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Start Spark shell.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ./bin/spark-shell --master spark://localhost:7077,localhost:17077\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Wait till the Spark shell connects to an active standalone Master.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Find out which standalone Master is active (there can only be one). Kill it. Observe how the other standalone Master takes over and lets the Spark shell register with itself. Check out the master's UI.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Optionally, kill the worker, make sure it goes away instantly in the active master's logs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"exercises/spark-exercise-take-multiple-jobs/","title":"Learning Jobs and Partitions Using take Action","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == Exercise: Learning Jobs and Partitions Using take Action

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The exercise aims for introducing take action and using spark-shell and web UI. It should introduce you to the concepts of partitions and jobs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The following snippet creates an RDD of 16 elements with 16 partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            scala> val r1 = sc.parallelize(0 to 15, 16)\nr1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at parallelize at <console>:18\n\nscala> r1.partitions.size\nres63: Int = 16\n\nscala> r1.foreachPartition(it => println(\">>> partition size: \" + it.size))\n...\n>>> partition size: 1\n>>> partition size: 1\n>>> partition size: 1\n>>> partition size: 1\n>>> partition size: 1\n>>> partition size: 1\n>>> partition size: 1\n>>> partition size: 1\n... // the machine has 8 cores\n... // so first 8 tasks get executed immediately\n... // with the others after a core is free to take on new tasks.\n>>> partition size: 1\n...\n>>> partition size: 1\n...\n>>> partition size: 1\n...\n>>> partition size: 1\n>>> partition size: 1\n...\n>>> partition size: 1\n>>> partition size: 1\n>>> partition size: 1\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            All 16 partitions have one element.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When you execute r1.take(1) only one job gets run since it is enough to compute one task on one partition.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME Snapshot from web UI - note the number of tasks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            However, when you execute r1.take(2) two jobs get run as the implementation assumes one job with one partition, and if the elements didn't total to the number of elements requested in take, quadruple the partitions to work on in the following jobs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME Snapshot from web UI - note the number of tasks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Can you guess how many jobs are run for r1.take(15)? How many tasks per job?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME Snapshot from web UI - note the number of tasks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Answer: 3.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"exercises/spark-first-app/","title":"Your first complete Spark application (using Scala and sbt)","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == Your first Spark application (using Scala and sbt)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            This page gives you the exact steps to develop and run a complete Spark application using http://www.scala-lang.org/[Scala] programming language and http://www.scala-sbt.org/[sbt] as the build tool.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            [TIP] Refer to Quick Start's http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/quick-start.html#self-contained-applications[Self-Contained Applications] in the official documentation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The sample application called SparkMe App is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === Overview

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            You're going to use http://www.scala-sbt.org/[sbt] as the project build tool. It uses build.sbt for the project's description as well as the dependencies, i.e. the version of Apache Spark and others.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The application's main code is under src/main/scala directory, in SparkMeApp.scala file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            With the files in a directory, executing sbt package results in a package that can be deployed onto a Spark cluster using spark-submit.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In this example, you're going to use Spark's local/spark-local.md[local mode].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === Project's build - build.sbt

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Any Scala project managed by sbt uses build.sbt as the central place for configuration, including project dependencies denoted as libraryDependencies.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            build.sbt

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            name         := \"SparkMe Project\"\nversion      := \"1.0\"\norganization := \"pl.japila\"\n\nscalaVersion := \"2.11.7\"\n\nlibraryDependencies += \"org.apache.spark\" %% \"spark-core\" % \"1.6.0-SNAPSHOT\"  // <1>\nresolvers += Resolver.mavenLocal\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            <1> Use the development version of Spark 1.6.0-SNAPSHOT

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === SparkMe Application

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The application uses a single command-line parameter (as args(0)) that is the file to process. The file is read and the number of lines printed out.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            package pl.japila.spark\n\nimport org.apache.spark.{SparkContext, SparkConf}\n\nobject SparkMeApp {\n  def main(args: Array[String]) {\n    val conf = new SparkConf().setAppName(\"SparkMe Application\")\n    val sc = new SparkContext(conf)\n\n    val fileName = args(0)\n    val lines = sc.textFile(fileName).cache\n\n    val c = lines.count\n    println(s\"There are $c lines in $fileName\")\n  }\n}\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === sbt version - project/build.properties

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            sbt (launcher) uses project/build.properties file to set (the real) sbt up

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            sbt.version=0.13.9\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TIP: With the file the build is more predictable as the version of sbt doesn't depend on the sbt launcher.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === Packaging Application

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Execute sbt package to package the application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            \u279c  sparkme-app  sbt package\n[info] Loading global plugins from /Users/jacek/.sbt/0.13/plugins\n[info] Loading project definition from /Users/jacek/dev/sandbox/sparkme-app/project\n[info] Set current project to SparkMe Project (in build file:/Users/jacek/dev/sandbox/sparkme-app/)\n[info] Compiling 1 Scala source to /Users/jacek/dev/sandbox/sparkme-app/target/scala-2.11/classes...\n[info] Packaging /Users/jacek/dev/sandbox/sparkme-app/target/scala-2.11/sparkme-project_2.11-1.0.jar ...\n[info] Done packaging.\n[success] Total time: 3 s, completed Sep 23, 2015 12:47:52 AM\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The application uses only classes that comes with Spark so package is enough.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In target/scala-2.11/sparkme-project_2.11-1.0.jar there is the final application ready for deployment.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === Submitting Application to Spark (local)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: The application is going to be deployed to local[*]. Change it to whatever cluster you have available (refer to spark-cluster.md[Running Spark in cluster]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spark-submit the SparkMe application and specify the file to process (as it is the only and required input parameter to the application), e.g. build.sbt of the project.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: build.sbt is sbt's build definition and is only used as an input file for demonstration purposes. Any file is going to work fine.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            \u279c  sparkme-app  ~/dev/oss/spark/bin/spark-submit --master \"local[*]\" --class pl.japila.spark.SparkMeApp target/scala-2.11/sparkme-project_2.11-1.0.jar build.sbt\nUsing Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties\nTo adjust logging level use sc.setLogLevel(\"INFO\")\n15/09/23 01:06:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n15/09/23 01:06:04 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.\nThere are 8 lines in build.sbt\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: Disregard the two above WARN log messages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            You're done. Sincere congratulations!

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"exercises/spark-hello-world-using-spark-shell/","title":"Spark's Hello World using Spark shell and Scala","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == Exercise: Spark's Hello World using Spark shell and Scala

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Run Spark shell and count the number of words in a file using MapReduce pattern.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Use sc.textFile to read the file into memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Use RDD.flatMap for a mapper step
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Use reduceByKey for a reducer step
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"exercises/spark-sql-hive-orc-example/","title":"Using Spark SQL to update data in Hive using ORC files","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == Using Spark SQL to update data in Hive using ORC files

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The example has showed up on Spark's users mailing list.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"exercises/spark-sql-hive-orc-example/#caution","title":"[CAUTION]","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • FIXME Offer a complete working solution in Scala
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • FIXME Load ORC files into dataframe ** val df = hiveContext.read.format(\"orc\").load(to/path) ====

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Solution was to use Hive in ORC format with partitions:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • A table in Hive stored as an ORC file (using partitioning)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Using SQLContext.sql to insert data into the table
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Using SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to merge your many small files into larger files optimized for your HDFS block size ** Since the CONCATENATE command operates on files in place it is transparent to any downstream processing
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Hive solution is just to concatenate the files ** it does not alter or change records. ** it's possible to update data in Hive using ORC format ** With transactional tables in Hive together with insert, update, delete, it does the \"concatenate \" for you automatically in regularly intervals. Currently this works only with tables in orc.format (stored as orc) ** Alternatively, use Hbase with Phoenix as the SQL layer on top ** Hive was originally not designed for updates, because it was.purely warehouse focused, the most recent one can do updates, deletes etc in a transactional way.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Criteria:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • spark-streaming/spark-streaming.md[Spark Streaming] jobs are receiving a lot of small events (avg 10kb)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Events are stored to HDFS, e.g. for Pig jobs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • There are a lot of small files in HDFS (several millions)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"external-shuffle-service/","title":"External Shuffle Service","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            External Shuffle Service is a Spark service to serve RDD and shuffle blocks outside and for Executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExternalShuffleService can be started as a command-line application or automatically as part of a worker node in a Spark cluster (e.g. Spark Standalone).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            External Shuffle Service is enabled in a Spark application using spark.shuffle.service.enabled configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"external-shuffle-service/ExecutorShuffleInfo/","title":"ExecutorShuffleInfo","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExecutorShuffleInfo is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"external-shuffle-service/ExternalBlockHandler/","title":"ExternalBlockHandler","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExternalBlockHandler is an RpcHandler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"external-shuffle-service/ExternalBlockHandler/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExternalBlockHandler takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TransportConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Registered Executors File
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ExternalBlockHandler creates the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ShuffleMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • OneForOneStreamManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExternalShuffleBlockResolver

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExternalBlockHandler is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExternalShuffleService is requested for an ExternalBlockHandler
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • YarnShuffleService is requested to serviceInit
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"external-shuffle-service/ExternalBlockHandler/#oneforonestreammanager","title":"OneForOneStreamManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExternalBlockHandler can be given or creates an OneForOneStreamManager when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#externalshuffleblockresolver","title":"ExternalShuffleBlockResolver

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExternalBlockHandler can be given or creates an ExternalShuffleBlockResolver to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExternalShuffleBlockResolver is used for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • registerExecutor when ExternalBlockHandler is requested to handle a RegisterExecutor message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • removeBlocks when ExternalBlockHandler is requested to handle a RemoveBlocks message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • getLocalDirs when ExternalBlockHandler is requested to handle a GetLocalDirsForExecutors message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • applicationRemoved when ExternalBlockHandler is requested to applicationRemoved
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • executorRemoved when ExternalBlockHandler is requested to executorRemoved
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • registerExecutor when ExternalBlockHandler is requested to reregisterExecutor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExternalShuffleBlockResolver is used for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • getBlockData and getRddBlockData for ManagedBufferIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • getBlockData and getContinuousBlocksData for ShuffleManagedBufferIterator

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExternalShuffleBlockResolver is closed when is ExternalBlockHandler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#registered-executors-file","title":"Registered Executors File

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExternalBlockHandler can be given a Java's File (or null) to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              This file is simply to create an ExternalShuffleBlockResolver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#messages","title":"Messages","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#fetchshuffleblocks","title":"FetchShuffleBlocks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Request to read a set of blocks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              \"Posted\" (created) when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • OneForOneBlockFetcher is requested to createFetchShuffleBlocksMsg

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              When received, ExternalBlockHandler requests the OneForOneStreamManager to registerStream (with a ShuffleManagedBufferIterator).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExternalBlockHandler prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Registered streamId [streamId] with [numBlockIds] buffers for client [clientId] from host [remoteAddress]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, ExternalBlockHandler responds with a StreamHandle (of streamId and numBlockIds).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#getlocaldirsforexecutors","title":"GetLocalDirsForExecutors","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#openblocks","title":"OpenBlocks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              For backward compatibility and like FetchShuffleBlocks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#registerexecutor","title":"RegisterExecutor","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#removeblocks","title":"RemoveBlocks","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#shufflemetrics","title":"ShuffleMetrics","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#executor-removed-notification","title":"Executor Removed Notification
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              void executorRemoved(\n  String executorId,\n  String appId)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              executorRemoved requests the ExternalShuffleBlockResolver to executorRemoved.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              executorRemoved\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExternalShuffleService is requested to executorRemoved
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#application-finished-notification","title":"Application Finished Notification
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              void applicationRemoved(\n  String appId,\n  boolean cleanupLocalDirs)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              applicationRemoved requests the ExternalShuffleBlockResolver to applicationRemoved.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              applicationRemoved\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExternalShuffleService is requested to applicationRemoved
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • YarnShuffleService (Spark on YARN) is requested to stopApplication
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Enable ALL logging level for org.apache.spark.network.shuffle.ExternalBlockHandler logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              log4j.logger.org.apache.spark.network.shuffle.ExternalBlockHandler=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/","title":"ExternalShuffleBlockResolver","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExternalShuffleBlockResolver manages converting shuffle BlockIds into physical segments of local files (from a process outside of Executors).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExternalShuffleBlockResolver takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • TransportConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • registeredExecutor File (Java's File)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Directory Cleaner
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExternalShuffleBlockResolver is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExternalBlockHandler is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#executors","title":"Executors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ExternalShuffleBlockResolver uses a mapping of ExecutorShuffleInfos by AppExecId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ExternalShuffleBlockResolver can (re)load this mapping from a registeredExecutor file or simply start from scratch.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                A new mapping is added when registering an executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#directory-cleaner-executor","title":"Directory Cleaner Executor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ExternalShuffleBlockResolver can be given a Java Executor or use a single worker thread executor (with spark-shuffle-directory-cleaner thread prefix).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The Executor is used to schedule a thread to clean up executor's local directories and non-shuffle and non-RDD files in executor's local directories.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#sparkshuffleservicefetchrddenabled","title":"spark.shuffle.service.fetch.rdd.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ExternalShuffleBlockResolver uses spark.shuffle.service.fetch.rdd.enabled configuration property to control whether or not to remove cached RDD files (alongside shuffle output files).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#registering-executor","title":"Registering Executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                void registerExecutor(\n  String appId,\n  String execId,\n  ExecutorShuffleInfo executorInfo)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerExecutor...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerExecutor is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExternalBlockHandler is requested to handle a RegisterExecutor message and reregisterExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#cleaning-up-local-directories-for-removed-executor","title":"Cleaning Up Local Directories for Removed Executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                void executorRemoved(\n  String executorId,\n  String appId)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                executorRemoved prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Clean up non-shuffle and non-RDD files associated with the finished executor [executorId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                executorRemoved looks up the executor in the executors internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When found, executorRemoved prints out the following INFO message to the logs and requests the Directory Cleaner Executor to execute asynchronous deletion of the executor's local directories (on a separate thread).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Cleaning up non-shuffle and non-RDD files in executor [AppExecId]'s [localDirs] local dirs\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When not found, executorRemoved prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Executor is not registered (appId=[appId], execId=[executorId])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                executorRemoved\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExternalBlockHandler is requested to executorRemoved
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#deletenonshuffleserviceservedfiles","title":"deleteNonShuffleServiceServedFiles
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                void deleteNonShuffleServiceServedFiles(\n  String[] dirs)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                deleteNonShuffleServiceServedFiles creates a Java FilenameFilter for files that meet all of the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                1. A file name does not end with .index or .data
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2. With rddFetchEnabled is enabled, a file name does not start with rdd_ prefix

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                deleteNonShuffleServiceServedFiles deletes files and directories (based on the FilenameFilter) in every directory (in the input dirs).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                deleteNonShuffleServiceServedFiles prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Successfully cleaned up files not served by shuffle service in directory: [localDir]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In case of any exceptions, deleteNonShuffleServiceServedFiles prints out the following ERROR message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Failed to delete files not served by shuffle service in directory: [localDir]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#application-removed-notification","title":"Application Removed Notification
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                void applicationRemoved(\n  String appId,\n  boolean cleanupLocalDirs)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                applicationRemoved...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                applicationRemoved is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExternalBlockHandler is requested to applicationRemoved
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#deleteexecutordirs","title":"deleteExecutorDirs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                void deleteExecutorDirs(\n  String[] dirs)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                deleteExecutorDirs...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#fetching-block-data","title":"Fetching Block Data
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ManagedBuffer getBlockData(\n  String appId,\n  String execId,\n  int shuffleId,\n  long mapId,\n  int reduceId)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getBlockData...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getBlockData is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ManagedBufferIterator is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ShuffleManagedBufferIterator is requested for next ManagedBuffer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Enable ALL logging level for org.apache.spark.network.shuffle.ExternalShuffleBlockResolver logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                log4j.logger.org.apache.spark.network.shuffle.ExternalShuffleBlockResolver=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/","title":"ExternalShuffleService","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ExternalShuffleService is a Spark service that can serve RDD and shuffle blocks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ExternalShuffleService manages shuffle output files so they are available to executors. As the shuffle output files are managed externally to the executors it offers an uninterrupted access to the shuffle output files regardless of executors being killed or down (esp. with Dynamic Allocation of Executors).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ExternalShuffleService can be launched from command line.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ExternalShuffleService is enabled on the driver and executors using spark.shuffle.service.enabled configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Spark on YARN uses a custom external shuffle service (YarnShuffleService).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"external-shuffle-service/ExternalShuffleService/#launching-externalshuffleservice","title":"Launching ExternalShuffleService

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ExternalShuffleService can be launched as a standalone application using spark-class.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                spark-class org.apache.spark.deploy.ExternalShuffleService\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#main-entry-point","title":"main Entry Point
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                main(\n  args: Array[String]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                main is the entry point of ExternalShuffleService standalone application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                main prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Started daemon with process name: [name]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                main registers signal handlers for TERM, HUP, INT signals.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                main loads the default Spark properties.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                main creates a SecurityManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                main turns spark.shuffle.service.enabled to true explicitly (since this service is started from the command line for a reason).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                main creates an ExternalShuffleService and starts it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                main prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Adding shutdown hook\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                main registers a shutdown hook. When triggered, the shutdown hook prints the following INFO message to the logs and requests the ExternalShuffleService to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Shutting down shuffle service.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#creating-instance","title":"Creating Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ExternalShuffleService takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SecurityManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExternalShuffleService is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExternalShuffleService standalone application is started
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Worker (Spark Standalone) is created (and initializes an ExternalShuffleService)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#transportserver","title":"TransportServer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  server: TransportServer\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExternalShuffleService uses an internal reference to a TransportServer that is created when ExternalShuffleService is started.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExternalShuffleService uses an ExternalBlockHandler to handle RPC messages (and serve RDD blocks and shuffle blocks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  TransportServer is requested to close when ExternalShuffleService is requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  TransportServer is used for metrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#port","title":"Port

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExternalShuffleService uses spark.shuffle.service.port configuration property for the port to listen to when started.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#sparkshuffleserviceenabled","title":"spark.shuffle.service.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExternalShuffleService uses spark.shuffle.service.enabled configuration property to control whether or not is enabled (and should be started when requested).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#externalblockhandler","title":"ExternalBlockHandler
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  blockHandler: ExternalBlockHandler\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExternalShuffleService creates an ExternalBlockHandler when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  With spark.shuffle.service.db.enabled and spark.shuffle.service.enabled configuration properties enabled, the ExternalBlockHandler is given a local directory with a registeredExecutors.ldb file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  blockHandler\u00a0is used to create a TransportContext that creates the TransportServer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  blockHandler\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • applicationRemoved
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • executorRemoved
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#findregisteredexecutorsdbfile","title":"findRegisteredExecutorsDBFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  findRegisteredExecutorsDBFile(\n  dbName: String): File\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  findRegisteredExecutorsDBFile returns one of the local directories (defined using spark.local.dir configuration property) with the input dbName file or null when no directories defined.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  findRegisteredExecutorsDBFile searches the local directories (defined using spark.local.dir configuration property) for the input dbName file. Unless found, findRegisteredExecutorsDBFile takes the first local directory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  With no local directories defined in spark.local.dir configuration property, findRegisteredExecutorsDBFile prints out the following WARN message to the logs and returns null.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  'spark.local.dir' should be set first when we use db in ExternalShuffleService. Note that this only affects standalone mode.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#starting-externalshuffleservice","title":"Starting ExternalShuffleService
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Starting shuffle service on port [port] (auth enabled = [authEnabled])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start creates a AuthServerBootstrap with authentication enabled (using SecurityManager).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start creates a TransportContext (with the ExternalBlockHandler) and requests it to create a server (on the port).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExternalShuffleService is requested to startIfEnabled and is launched (as a command-line application)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#startifenabled","title":"startIfEnabled
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  startIfEnabled(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  startIfEnabled starts the external shuffle service if enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  startIfEnabled\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Worker (Spark Standalone) is requested to startExternalShuffleService
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#executor-removed-notification","title":"Executor Removed Notification
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  executorRemoved(\n  executorId: String,\n  appId: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  executorRemoved requests the ExternalBlockHandler to executorRemoved.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  executorRemoved\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Worker (Spark Standalone) is requested to handleExecutorStateChanged
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#application-finished-notification","title":"Application Finished Notification
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  applicationRemoved(\n  appId: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  applicationRemoved requests the ExternalBlockHandler to applicationRemoved (with cleanupLocalDirs flag enabled).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  applicationRemoved\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Worker (Spark Standalone) is requested to handle WorkDirCleanup message and maybeCleanupApplication
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Enable ALL logging level for org.apache.spark.deploy.ExternalShuffleService logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  log4j.logger.org.apache.spark.deploy.ExternalShuffleService=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"external-shuffle-service/configuration-properties/","title":"Spark Configuration Properties of External Shuffle Service","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The following are configuration properties of External Shuffle Service.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"external-shuffle-service/configuration-properties/#sparkshuffleservicedbenabled","title":"spark.shuffle.service.db.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Whether to use db in ExternalShuffleService. Note that this only affects standalone mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExternalShuffleService is requested for an ExternalBlockHandler
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Worker (Spark Standalone) is requested to handle a WorkDirCleanup message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"external-shuffle-service/configuration-properties/#sparkshuffleserviceenabled","title":"spark.shuffle.service.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Controls whether to use the External Shuffle Service

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  LocalSparkCluster turns this property off explicitly when started.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlacklistTracker is requested to updateBlacklistForFetchFailure
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutorMonitor is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutorAllocationManager is requested to validateSettings
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkEnv utility is requested to create a \"base\" SparkEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExternalShuffleService is created and started
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Worker (Spark Standalone) is requested to handle a WorkDirCleanup message or started
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutorRunnable (Spark on YARN) is requested to startContainer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"external-shuffle-service/configuration-properties/#sparkshuffleservicefetchrddenabled","title":"spark.shuffle.service.fetch.rdd.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Enables ExternalShuffleService for fetching disk persisted RDD blocks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When enabled with Dynamic Resource Allocation executors having only disk persisted blocks are considered idle after spark.dynamicAllocation.executorIdleTimeout and will be released accordingly.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExternalShuffleBlockResolver is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkEnv utility is requested to create a \"base\" SparkEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutorMonitor is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"external-shuffle-service/configuration-properties/#sparkshuffleserviceport","title":"spark.shuffle.service.port

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Port of the external shuffle service

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Default: 7337

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExternalShuffleService is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • StorageUtils utility is requested for the port of an external shuffle service
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"features/","title":"Features","text":""},{"location":"history-server/","title":"Spark History Server","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Spark History Server is the web UI of Spark applications with event log collection enabled (based on spark.eventLog.enabled configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Spark History Server is an extension of Spark's web UI.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Spark History Server can be started using start-history-server.sh and stopped using stop-history-server.sh shell scripts.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Spark History Server supports custom configuration properties that can be defined using --properties-file [propertiesFile] command-line option. The properties file can have any valid spark.-prefixed Spark property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ ./sbin/start-history-server.sh --properties-file history.properties\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If not specified explicitly, Spark History Server uses the default configuration file, i.e. spark-defaults.conf.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Spark History Server can replay events from event log files recorded by EventLoggingListener.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"history-server/#start-history-serversh-shell-script","title":"start-history-server.sh Shell Script

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $SPARK_HOME/sbin/start-history-server.sh shell script (where SPARK_HOME is the directory of your Spark installation) is used to start a Spark History Server instance.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ ./sbin/start-history-server.sh\nstarting org.apache.spark.deploy.history.HistoryServer, logging to .../spark/logs/spark-jacek-org.apache.spark.deploy.history.HistoryServer-1-japila.out\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Internally, start-history-server.sh script starts org.apache.spark.deploy.history.HistoryServer standalone application (using spark-daemon.sh shell script).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ ./bin/spark-class org.apache.spark.deploy.history.HistoryServer\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Tip

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Using the more explicit approach with spark-class to start Spark History Server could be easier to trace execution by seeing the logs printed out to the standard output and hence terminal directly.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When started, start-history-server.sh prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Started daemon with process name: [processName]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start-history-server.sh registers signal handlers (using SignalUtils) for TERM, HUP, INT to log their execution:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  RECEIVED SIGNAL [signal]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start-history-server.sh inits security if enabled (based on spark.history.kerberos.enabled configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start-history-server.sh creates a SecurityManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start-history-server.sh creates a ApplicationHistoryProvider (based on spark.history.provider configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, start-history-server.sh creates a HistoryServer and requests it to bind to the port (based on spark.history.ui.port configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The host's IP can be specified using SPARK_LOCAL_IP environment variable (defaults to 0.0.0.0).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start-history-server.sh prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Bound HistoryServer to [host], and started at [webUrl]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start-history-server.sh registers a shutdown hook to call stop on the HistoryServer instance.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"history-server/#stop-history-serversh-shell-script","title":"stop-history-server.sh Shell Script

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $SPARK_HOME/sbin/stop-history-server.sh shell script (where SPARK_HOME is the directory of your Spark installation) is used to stop a running instance of Spark History Server.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  $ ./sbin/stop-history-server.sh\nstopping org.apache.spark.deploy.history.HistoryServer\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"history-server/ApplicationCache/","title":"ApplicationCache","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[ApplicationCache]] ApplicationCache

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ApplicationCache is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ApplicationCache is <> exclusively when HistoryServer is HistoryServer.md#appCache[created].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ApplicationCache uses https://github.com/google/guava/wiki/Release14[Google Guava 14.0.1] library for the internal <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[internal-registries]] .ApplicationCache's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | appLoader | [[appLoader]] Google Guava's https://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/CacheLoader.html[CacheLoader] with a custom ++https://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/CacheLoader.html#load(K)++[load] which is simply <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | removalListener | [[removalListener]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | appCache a| [[appCache]] Google Guava's https://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/LoadingCache.html[LoadingCache] of CacheKey keys and CacheEntry entries

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when ApplicationCache is requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • <> given appId and attemptId IDs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • FIXME (other uses)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • | metrics | [[metrics]] |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    === [[creating-instance]] Creating ApplicationCache Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ApplicationCache takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • [[operations]] ApplicationCacheOperations.md[ApplicationCacheOperations]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • [[retainedApplications]] retainedApplications
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • [[clock]] Clock

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ApplicationCache initializes the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    === [[loadApplicationEntry]] loadApplicationEntry Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"history-server/ApplicationCache/#source-scala","title":"[source, scala]","text":""},{"location":"history-server/ApplicationCache/#loadapplicationentryappid-string-attemptid-optionstring-cacheentry","title":"loadApplicationEntry(appId: String, attemptId: Option[String]): CacheEntry","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    loadApplicationEntry...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NOTE: loadApplicationEntry is used exclusively when ApplicationCache is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    === [[load]] Loading Cached Spark Application UI -- load Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"history-server/ApplicationCache/#source-scala_1","title":"[source, scala]","text":""},{"location":"history-server/ApplicationCache/#loadkey-cachekey-cacheentry","title":"load(key: CacheKey): CacheEntry","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NOTE: load is part of Google Guava's https://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/CacheLoader.html[CacheLoader] to retrieve a CacheEntry, based on a CacheKey, for <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    load simply relays to <> with the appId and attemptId of the input CacheKey.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    === [[get]] Requesting Cached UI of Spark Application (CacheEntry) -- get Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"history-server/ApplicationCache/#source-scala_2","title":"[source, scala]","text":""},{"location":"history-server/ApplicationCache/#getappid-string-attemptid-optionstring-none-cacheentry","title":"get(appId: String, attemptId: Option[String] = None): CacheEntry","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    get...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NOTE: get is used exclusively when ApplicationCache is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    === [[withSparkUI]] Executing Closure While Holding Application's UI Read Lock -- withSparkUI Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"history-server/ApplicationCache/#source-scala_3","title":"[source, scala]","text":""},{"location":"history-server/ApplicationCache/#withsparkuitfn-sparkui-t-t","title":"withSparkUIT(fn: SparkUI => T): T","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    withSparkUI...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NOTE: withSparkUI is used when HistoryServer is requested to HistoryServer.md#withSparkUI[withSparkUI] and HistoryServer.md#loadAppUi[loadAppUi].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"history-server/ApplicationCacheOperations/","title":"ApplicationCacheOperations","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    == [[ApplicationCacheOperations]] ApplicationCacheOperations

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ApplicationCacheOperations is the <> of...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    [[contract]] [source, scala]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    package org.apache.spark.deploy.history

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    trait ApplicationCacheOperations { // only required methods that have no implementation // the others follow def getAppUI(appId: String, attemptId: Option[String]): Option[LoadedAppUI] def attachSparkUI( appId: String, attemptId: Option[String], ui: SparkUI, completed: Boolean): Unit def detachSparkUI(appId: String, attemptId: Option[String], ui: SparkUI): Unit }

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NOTE: ApplicationCacheOperations is a private[history] contract.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    .(Subset of) ApplicationCacheOperations Contract [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | getAppUI | [[getAppUI]] spark-webui-SparkUI.md[SparkUI] (the UI of a Spark application)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used exclusively when ApplicationCache is requested for ApplicationCache.md#loadApplicationEntry[loadApplicationEntry]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | attachSparkUI | [[attachSparkUI]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | detachSparkUI | [[detachSparkUI]] |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    [[implementations]] NOTE: HistoryServer.md[HistoryServer] is the one and only known implementation of <> in Apache Spark."},{"location":"history-server/ApplicationHistoryProvider/","title":"ApplicationHistoryProvider","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ApplicationHistoryProvider is an abstraction of history providers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"history-server/ApplicationHistoryProvider/#contract","title":"Contract","text":""},{"location":"history-server/ApplicationHistoryProvider/#getapplicationinfo","title":"getApplicationInfo
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getApplicationInfo(\n  appId: String): Option[ApplicationInfo]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"history-server/ApplicationHistoryProvider/#getappui","title":"getAppUI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getAppUI(\n  appId: String,\n  attemptId: Option[String]): Option[LoadedAppUI]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    SparkUI for a given application (by appId)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when HistoryServer is requested for the UI of a Spark application

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"history-server/ApplicationHistoryProvider/#getlisting","title":"getListing
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getListing(): Iterator[ApplicationInfo]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"history-server/ApplicationHistoryProvider/#onuidetached","title":"onUIDetached
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    onUIDetached(\n  appId: String,\n  attemptId: Option[String],\n  ui: SparkUI): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"history-server/ApplicationHistoryProvider/#writeeventlogs","title":"writeEventLogs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    writeEventLogs(\n  appId: String,\n  attemptId: Option[String],\n  zipStream: ZipOutputStream): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Writes events to a stream

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"history-server/ApplicationHistoryProvider/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • FsHistoryProvider
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"history-server/EventLogFileWriter/","title":"EventLogFileWriter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    EventLogFileWriter is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"history-server/EventLoggingListener/","title":"EventLoggingListener","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    EventLoggingListener is a SparkListener that writes out JSON-encoded events of a Spark application with event logging enabled (based on spark.eventLog.enabled configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    EventLoggingListener supports custom configuration properties.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    EventLoggingListener writes out log files to a directory (based on spark.eventLog.dir configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"history-server/EventLoggingListener/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    EventLoggingListener takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Application ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Application Attempt ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Log Directory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Hadoop Configuration

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      EventLoggingListener is created\u00a0when SparkContext is created (with spark.eventLog.enabled enabled).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"history-server/EventLoggingListener/#eventlogfilewriter","title":"EventLogFileWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      logWriter: EventLogFileWriter\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      EventLoggingListener creates a EventLogFileWriter when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      All arguments to create an EventLoggingListener are passed to the EventLogFileWriter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The EventLogFileWriter is started when EventLoggingListener is started.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The EventLogFileWriter is stopped when EventLoggingListener is stopped.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The EventLogFileWriter is requested to writeEvent when EventLoggingListener is requested to start and log an event.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"history-server/EventLoggingListener/#starting-eventlogginglistener","title":"Starting EventLoggingListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      start(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      start requests the EventLogFileWriter to start and initEventLog.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"history-server/EventLoggingListener/#initeventlog","title":"initEventLog
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      initEventLog(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      initEventLog...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"history-server/EventLoggingListener/#logging-event","title":"Logging Event
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      logEvent(\n  event: SparkListenerEvent,\n  flushLogger: Boolean = false): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      logEvent persists the given SparkListenerEvent in JSON format.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      logEvent converts the event to JSON format and requests the EventLogFileWriter to write it out.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"history-server/EventLoggingListener/#stopping-eventlogginglistener","title":"Stopping EventLoggingListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      stop requests the EventLogFileWriter to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      stop is used when SparkContext is requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"history-server/EventLoggingListener/#inprogress-file-extension","title":"inprogress File Extension

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      EventLoggingListener uses .inprogress file extension for in-flight event log files of active Spark applications.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"history-server/EventLoggingListener/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Enable ALL logging level for org.apache.spark.scheduler.EventLoggingListener logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      log4j.logger.org.apache.spark.scheduler.EventLoggingListener=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"history-server/FsHistoryProvider/","title":"FsHistoryProvider","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      FsHistoryProvider is the default ApplicationHistoryProvider for Spark History Server.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"history-server/FsHistoryProvider/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      FsHistoryProvider takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Clock (default: SystemClock)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        FsHistoryProvider is created\u00a0when HistoryServer standalone application is started (and no spark.history.provider configuration property was defined).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"history-server/FsHistoryProvider/#path-of-application-history-cache","title":"Path of Application History Cache
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        storePath: Option[File]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        FsHistoryProvider uses spark.history.store.path configuration property for the directory to cache application history.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        With storePath defined, FsHistoryProvider uses a LevelDB as the KVStore. Otherwise, a InMemoryStore.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        With storePath defined, FsHistoryProvider uses a HistoryServerDiskManager as the disk manager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"history-server/FsHistoryProvider/#disk-manager","title":"Disk Manager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        diskManager: Option[HistoryServerDiskManager]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        FsHistoryProvider creates a HistoryServerDiskManager when created (with storePath defined based on spark.history.store.path configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        FsHistoryProvider uses the HistoryServerDiskManager for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • startPolling
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • getAppUI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • onUIDetached
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • cleanAppData
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"history-server/FsHistoryProvider/#sparkui-of-spark-application","title":"SparkUI of Spark Application
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getAppUI(\n  appId: String,\n  attemptId: Option[String]): Option[LoadedAppUI]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getAppUI is part of the ApplicationHistoryProvider abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getAppUI...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"history-server/FsHistoryProvider/#onuidetached","title":"onUIDetached
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        onUIDetached(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        onUIDetached is part of the ApplicationHistoryProvider abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        onUIDetached...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"history-server/FsHistoryProvider/#loaddiskstore","title":"loadDiskStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        loadDiskStore(\n  dm: HistoryServerDiskManager,\n  appId: String,\n  attempt: AttemptInfoWrapper): KVStore\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        loadDiskStore...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        loadDiskStore is used in getAppUI (with HistoryServerDiskManager available).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"history-server/FsHistoryProvider/#createinmemorystore","title":"createInMemoryStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        createInMemoryStore(\n  attempt: AttemptInfoWrapper): KVStore\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        createInMemoryStore...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        createInMemoryStore is used in getAppUI.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"history-server/FsHistoryProvider/#rebuildappstore","title":"rebuildAppStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        rebuildAppStore(\n  store: KVStore,\n  reader: EventLogFileReader,\n  lastUpdated: Long): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        rebuildAppStore...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        rebuildAppStore is used in loadDiskStore and createInMemoryStore.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"history-server/FsHistoryProvider/#cleanappdata","title":"cleanAppData
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        cleanAppData(\n  appId: String,\n  attemptId: Option[String],\n  logPath: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        cleanAppData...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        cleanAppData is used in checkForLogs and deleteAttemptLogs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"history-server/FsHistoryProvider/#polling-for-logs","title":"Polling for Logs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        startPolling(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        startPolling...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        startPolling is used in initialize and startSafeModeCheckThread.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"history-server/FsHistoryProvider/#checking-available-event-logs","title":"Checking Available Event Logs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        checkForLogs(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        checkForLogs...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"history-server/FsHistoryProvider/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Enable ALL logging level for org.apache.spark.deploy.history.FsHistoryProvider logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        log4j.logger.org.apache.spark.deploy.history.FsHistoryProvider=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"history-server/HistoryAppStatusStore/","title":"HistoryAppStatusStore","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        HistoryAppStatusStore is an AppStatusStore for SparkUIs in Spark History Server.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"history-server/HistoryAppStatusStore/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        HistoryAppStatusStore takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • KVStore

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          HistoryAppStatusStore is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • FsHistoryProvider is requested for a SparkUI (of a Spark application)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"history-server/HistoryAppStatusStore/#executorlogurlhandler","title":"ExecutorLogUrlHandler
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          logUrlHandler: ExecutorLogUrlHandler\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          HistoryAppStatusStore creates an ExecutorLogUrlHandler (for the logUrlPattern) when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          HistoryAppStatusStore uses it when requested to replaceLogUrls.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"history-server/HistoryAppStatusStore/#executorlist","title":"executorList
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          executorList(\n  exec: v1.ExecutorSummary,\n  urlPattern: String): v1.ExecutorSummary\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          executorList...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          executorList\u00a0is part of the AppStatusStore abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"history-server/HistoryAppStatusStore/#executorsummary","title":"executorSummary
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          executorSummary(\n  executorId: String): v1.ExecutorSummary\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          executorSummary...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          executorSummary\u00a0is part of the AppStatusStore abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"history-server/HistoryAppStatusStore/#replacelogurls","title":"replaceLogUrls
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          replaceLogUrls(\n  exec: v1.ExecutorSummary,\n  urlPattern: String): v1.ExecutorSummary\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          replaceLogUrls...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          replaceLogUrls\u00a0is used when HistoryAppStatusStore is requested to executorList and executorSummary.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"history-server/HistoryServer/","title":"HistoryServer","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          HistoryServer is an extension of the web UI for reviewing event logs of running (active) and completed Spark applications with event log collection enabled (based on spark.eventLog.enabled configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"history-server/HistoryServer/#starting-historyserver-standalone-application","title":"Starting HistoryServer Standalone Application
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          main(\n  argStrings: Array[String]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          main creates a HistoryServerArguments (with the given argStrings arguments).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          main initializes security.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          main creates an ApplicationHistoryProvider (based on spark.history.provider configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          main creates a HistoryServer (with the ApplicationHistoryProvider and spark.history.ui.port configuration property) and requests it to bind.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          main requests the ApplicationHistoryProvider to start.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          main registers a shutdown hook that requests the HistoryServer to stop and sleeps...till the end of the world (giving the daemon thread a go).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"history-server/HistoryServer/#creating-instance","title":"Creating Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          HistoryServer takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ApplicationHistoryProvider
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SecurityManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Port number

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When created, HistoryServer initializes itself.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            HistoryServer is created\u00a0when HistoryServer standalone application is started.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"history-server/HistoryServer/#applicationcacheoperations","title":"ApplicationCacheOperations

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            HistoryServer is a ApplicationCacheOperations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"history-server/HistoryServer/#uiroot","title":"UIRoot

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            HistoryServer is a UIRoot.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"history-server/HistoryServer/#initializing-historyserver","title":"Initializing HistoryServer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            initialize(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            initialize is part of the WebUI abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            initialize...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"history-server/HistoryServer/#attaching-sparkui","title":"Attaching SparkUI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            attachSparkUI(\n  appId: String,\n  attemptId: Option[String],\n  ui: SparkUI,\n  completed: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            attachSparkUI is part of the ApplicationCacheOperations abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            attachSparkUI...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"history-server/HistoryServer/#spark-ui","title":"Spark UI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getAppUI(\n  appId: String,\n  attemptId: Option[String]): Option[LoadedAppUI]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getAppUI is part of the ApplicationCacheOperations abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getAppUI requests the ApplicationHistoryProvider for the Spark UI of a Spark application (based on the appId and attemptId).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"history-server/HistoryServer/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Enable ALL logging level for org.apache.spark.deploy.history.HistoryServer logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            log4j.logger.org.apache.spark.deploy.history.HistoryServer=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"history-server/HistoryServerArguments/","title":"HistoryServerArguments","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == HistoryServerArguments

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            HistoryServerArguments is the command-line parser for the index.md[History Server].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When HistoryServerArguments is executed with a single command-line parameter it is assumed to be the event logs directory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ ./sbin/start-history-server.sh /tmp/spark-events\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            This is however deprecated since Spark 1.1.0 and you should see the following WARN message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            WARN HistoryServerArguments: Setting log directory through the command line is deprecated as of Spark 1.1.0. Please set this through spark.history.fs.logDirectory instead.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The same WARN message shows up for --dir and -d command-line options.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            --properties-file [propertiesFile] command-line option specifies the file with the custom spark-properties.md[Spark properties].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: When not specified explicitly, History Server uses the default configuration file, i.e. spark-properties.md#spark-defaults-conf[spark-defaults.conf].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"history-server/HistoryServerArguments/#tip","title":"[TIP]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Enable WARN logging level for org.apache.spark.deploy.history.HistoryServerArguments logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            log4j.logger.org.apache.spark.deploy.history.HistoryServerArguments=WARN\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"history-server/HistoryServerArguments/#refer-to-spark-loggingmdlogging","title":"Refer to spark-logging.md[Logging].","text":""},{"location":"history-server/HistoryServerDiskManager/","title":"HistoryServerDiskManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            HistoryServerDiskManager is a disk manager for FsHistoryProvider.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"history-server/HistoryServerDiskManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            HistoryServerDiskManager takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Path
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • KVStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Clock

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              HistoryServerDiskManager is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • FsHistoryProvider is created (and initializes a diskManager)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"history-server/HistoryServerDiskManager/#initializing","title":"Initializing
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              initialize(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              initialize...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              initialize\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • FsHistoryProvider is requested to startPolling
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/HistoryServerDiskManager/#releasing-application-store","title":"Releasing Application Store
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              release(\n  appId: String,\n  attemptId: Option[String],\n  delete: Boolean = false): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              release...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              release\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • FsHistoryProvider is requested to onUIDetached, cleanAppData and loadDiskStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/JsonProtocol/","title":"JsonProtocol Utility","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              JsonProtocol is an utility to convert SparkListenerEvents to and from JSON format.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"history-server/JsonProtocol/#objectmapper","title":"ObjectMapper

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              JsonProtocol uses an Jackson Databind ObjectMapper for performing conversions to and from JSON.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/JsonProtocol/#converting-spark-event-to-json","title":"Converting Spark Event to JSON
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              sparkEventToJson(\n  event: SparkListenerEvent): JValue\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              sparkEventToJson converts the given SparkListenerEvent to JSON format.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              sparkEventToJson\u00a0is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/JsonProtocol/#converting-json-to-spark-event","title":"Converting JSON to Spark Event
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              sparkEventFromJson(\n  json: JValue): SparkListenerEvent\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              sparkEventFromJson converts a JSON-encoded event to a SparkListenerEvent.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              sparkEventFromJson\u00a0is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/ReplayListenerBus/","title":"ReplayListenerBus","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ReplayListenerBus is a SparkListenerBus that can replay JSON-encoded SparkListenerEvent events.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ReplayListenerBus is used by FsHistoryProvider.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"history-server/ReplayListenerBus/#replaying-json-encoded-sparklistenerevents","title":"Replaying JSON-encoded SparkListenerEvents
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              replay(\n  logData: InputStream,\n  sourceName: String,\n  maybeTruncated: Boolean = false): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              replay reads JSON-encoded SparkListener.md#SparkListenerEvent[SparkListenerEvent] events from logData (one event per line) and posts them to all registered SparkListenerInterfaces.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              replay uses spark-history-server:JsonProtocol.md#sparkEventFromJson[JsonProtocol to convert JSON-encoded events to SparkListenerEvent objects].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: replay uses jackson from http://json4s.org/[json4s] library to parse the AST for JSON.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              When there is an exception parsing a JSON event, you may see the following WARN message in the logs (for the last line) or a JsonParseException.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              WARN Got JsonParseException from log file $sourceName at line [lineNumber], the file might not have finished writing cleanly.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Any other non-IO exceptions end up with the following ERROR messages in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ERROR Exception parsing Spark event log: [sourceName]\nERROR Malformed line #[lineNumber]: [currentLine]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: The sourceName input argument is only used for messages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/SQLHistoryListener/","title":"SQLHistoryListener","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == SQLHistoryListener

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SQLHistoryListener is a custom spark-sql-SQLListener.md[SQLListener] for index.md[History Server]. It attaches spark-sql-webui.md#creating-instance[SQL tab] to History Server's web UI only when the first spark-sql-SQLListener.md#SparkListenerSQLExecutionStart[SparkListenerSQLExecutionStart] arrives and shuts <> off. It also handles <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: Support for SQL UI in History Server was added in SPARK-11206 Support SQL UI on the history server.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              CAUTION: FIXME Add the link to the JIRA.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              === [[onOtherEvent]] onOtherEvent

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"history-server/SQLHistoryListener/#source-scala","title":"[source, scala]","text":""},{"location":"history-server/SQLHistoryListener/#onothereventevent-sparklistenerevent-unit","title":"onOtherEvent(event: SparkListenerEvent): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              When SparkListenerSQLExecutionStart event comes, onOtherEvent attaches spark-sql-webui.md#creating-instance[SQL tab] to web UI and passes the call to the parent spark-sql-SQLListener.md[SQLListener].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              === [[onTaskEnd]] onTaskEnd

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              === [[creating-instance]] Creating SQLHistoryListener Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SQLHistoryListener is created using a (private[sql]) SQLHistoryListenerFactory class (which is SparkHistoryListenerFactory).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The SQLHistoryListenerFactory class is registered when spark-webui-SparkUI.md#createHistoryUI[SparkUI creates a web UI for History Server] as a Java service in META-INF/services/org.apache.spark.scheduler.SparkHistoryListenerFactory:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              org.apache.spark.sql.execution.ui.SQLHistoryListenerFactory\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: Loading the service uses Java's https://docs.oracle.com/javase/8/docs/api/java/util/ServiceLoader.html#load-java.lang.Class-java.lang.ClassLoader-[ServiceLoader.load] method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              === [[onExecutorMetricsUpdate]] onExecutorMetricsUpdate

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              onExecutorMetricsUpdate does nothing.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"history-server/configuration-properties/","title":"Configuration Properties","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The following contains the configuration properties of EventLoggingListener and HistoryServer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"history-server/configuration-properties/#sparkeventlog","title":"spark.eventLog","text":""},{"location":"history-server/configuration-properties/#bufferkb","title":"buffer.kb

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.eventLog.buffer.kb

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Size of the buffer to use when writing to output streams

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: 100k

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#compress","title":"compress

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.eventLog.compress

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Enables event compression (using a CompressionCodec)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#compressioncodec","title":"compression.codec

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.eventLog.compression.codec

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The codec used to compress event log (with spark.eventLog.compress enabled). By default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also use fully qualified class names to specify the codec.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: zstd

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#dir","title":"dir

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.eventLog.dir

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Directory where Spark events are logged to (e.g. hdfs://namenode:8021/directory)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: /tmp/spark-events

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The directory must exist before SparkContext can be created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#enabled","title":"enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.eventLog.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Enables persisting Spark events

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#erasurecodingenabled","title":"erasureCoding.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.eventLog.erasureCoding.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#gcmetricsyounggenerationgarbagecollectors","title":"gcMetrics.youngGenerationGarbageCollectors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.eventLog.gcMetrics.youngGenerationGarbageCollectors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Names of supported young generation garbage collectors. A name usually is the output of GarbageCollectorMXBean.getName.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: Copy, PS Scavenge, ParNew, G1 Young Generation (the built-in young generation garbage collectors)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#gcmetricsoldgenerationgarbagecollectors","title":"gcMetrics.oldGenerationGarbageCollectors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.eventLog.gcMetrics.oldGenerationGarbageCollectors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Names of supported old generation garbage collectors. A name usually is the output of GarbageCollectorMXBean.getName.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: MarkSweepCompact, PS MarkSweep, ConcurrentMarkSweep, G1 Old Generation (the built-in old generation garbage collectors)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#logblockupdatesenabled","title":"logBlockUpdates.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.eventLog.logBlockUpdates.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Enables log RDD block updates using EventLoggingListener

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#logstageexecutormetrics","title":"logStageExecutorMetrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.eventLog.logStageExecutorMetrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Enables logging of per-stage peaks of executor metrics (for each executor) to the event log

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#longformenabled","title":"longForm.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.eventLog.longForm.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#overwrite","title":"overwrite

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.eventLog.overwrite

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Enables deleting (or at least overwriting) an existing .inprogress event log files

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#rollingenabled","title":"rolling.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.eventLog.rolling.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Enables rolling over event log files. When enabled, cuts down each event log file to spark.eventLog.rolling.maxFileSize

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#rollingmaxfilesize","title":"rolling.maxFileSize

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.eventLog.rolling.maxFileSize

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Max size of event log file to be rolled over (with spark.eventLog.rolling.enabled enabled)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: 128m

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Must be at least 10 MiB

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#sparkhistory","title":"spark.history","text":""},{"location":"history-server/configuration-properties/#fslogdirectory","title":"fs.logDirectory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.history.fs.logDirectory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The directory for event log files. The directory has to exist before starting History Server.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: file:/tmp/spark-events

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#kerberosenabled","title":"kerberos.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.history.kerberos.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Whether to enable (true) or disable (false) security when working with HDFS with security enabled (Kerberos).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#kerberoskeytab","title":"kerberos.keytab

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.history.kerberos.keytab

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Keytab to use for login to Kerberos. Required when spark.history.kerberos.enabled is enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: (empty)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#kerberosprincipal","title":"kerberos.principal

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.history.kerberos.principal

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Kerberos principal. Required when spark.history.kerberos.enabled is enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: (empty)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#provider","title":"provider

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.history.provider

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Fully-qualified class name of an ApplicationHistoryProvider for HistoryServer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: org.apache.spark.deploy.history.FsHistoryProvider

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#retainedapplications","title":"retainedApplications

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.history.retainedApplications

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              How many Spark applications HistoryServer should retain

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: 50

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#storepath","title":"store.path

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.history.store.path

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Local directory where to cache application history information (by )

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: (undefined) (i.e. all history information will be kept in memory)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#uimaxapplications","title":"ui.maxApplications

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.history.ui.maxApplications

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              How many Spark applications HistoryServer should show in the UI

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: (unbounded)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"history-server/configuration-properties/#uiport","title":"ui.port

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spark.history.ui.port

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The port of History Server's web UI.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: 18080

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"local/","title":"Spark local","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Spark local is one of the available runtime environments in Apache Spark. It is the only available runtime with no need for a proper cluster manager (and hence many call it a pseudo-cluster, however such concept do exist in Spark and is a bit different).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Spark local is used for the following master URLs (as specified using <<../SparkConf.md#, SparkConf.setMaster>> method or <<../configuration-properties.md#spark.master, spark.master>> configuration property):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • local (with exactly 1 CPU core)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • local[n] (with exactly n CPU cores)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • local[*] (with the total number of CPU cores that is the number of available CPU cores on the local machine)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • local[n, m] (with exactly n CPU cores and m retries when a task fails)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • local[*, m] (with the total number of CPU cores that is the number of available CPU cores on the local machine)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Internally, Spark local uses <> as the <<../SchedulerBackend.md#, SchedulerBackend>> and executor:ExecutorBackend.md[].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In this non-distributed multi-threaded runtime environment, Spark spawns all the main execution components - the spark-driver.md[driver] and an executor:Executor.md[] - in the same single JVM.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The default parallelism is the number of threads as specified in the <>. This is the only mode where a driver is used for execution (as it acts both as the driver and the only executor).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The local mode is very convenient for testing, debugging or demonstration purposes as it requires no earlier setup to launch Spark applications.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              This mode of operation is also called http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark[Spark in-process] or (less commonly) a local version of Spark.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext.isLocal returns true when Spark runs in local mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              scala> sc.isLocal\nres0: Boolean = true\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Spark shell defaults to local mode with local[*] as the the master URL.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              scala> sc.master\nres0: String = local[*]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Tasks are not re-executed on failure in local mode (unless <> is used).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The scheduler:TaskScheduler.md[task scheduler] in local mode works with local/spark-LocalSchedulerBackend.md[LocalSchedulerBackend] task scheduler backend.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"local/#master-url","title":"Master URL","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              You can run Spark in local mode using local, local[n] or the most general local[*] for the master URL.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The URL says how many threads can be used in total:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • local uses 1 thread only.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • local[n] uses n threads.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses https://docs.oracle.com/javase/8/docs/api/java/lang/Runtime.html#availableProcessors--[Runtime.getRuntime.availableProcessors()] to know the number).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: What happens when there are less cores than n in local[n] master URL? \"Breaks\" scheduling as Spark assumes more CPU cores available to execute tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • [[local-with-retries]] local[N, maxFailures] (called local-with-retries) with N being * or the number of threads to use (as explained above) and maxFailures being the value of <<../configuration-properties.md#spark.task.maxFailures, spark.task.maxFailures>> configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[task-submission]] Task Submission a.k.a. reviveOffers

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              .TaskSchedulerImpl.submitTasks in local mode image::taskscheduler-submitTasks-local-mode.png[align=\"center\"]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              When ReviveOffers or StatusUpdate messages are received, local/spark-LocalEndpoint.md[LocalEndpoint] places an offer to TaskSchedulerImpl (using TaskSchedulerImpl.resourceOffers).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              If there is one or more tasks that match the offer, they are launched (using executor.launchTask method).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The number of tasks to be launched is controlled by the number of threads as specified in <>. The executor uses threads to spawn the tasks."},{"location":"local/LauncherBackend/","title":"LauncherBackend","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[LauncherBackend]] LauncherBackend

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              LauncherBackend is the <> of <> that can <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [[contract]] .LauncherBackend Contract (Abstract Methods Only) [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | conf a| [[conf]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"local/LauncherBackend/#source-scala","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#conf-sparkconf","title":"conf: SparkConf","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkConf.md[]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used exclusively when LauncherBackend is requested to <> (to access configuration-properties.md#spark.launcher.port[spark.launcher.port] and configuration-properties.md#spark.launcher.secret[spark.launcher.secret] configuration properties)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | onStopRequest a| [[onStopRequest]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"local/LauncherBackend/#source-scala_1","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#onstoprequest-unit","title":"onStopRequest(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Handles stop requests (to stop the Spark application as gracefully as possible)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used exclusively when LauncherBackend is requested to <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [[creating-instance]] LauncherBackend takes no arguments to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: LauncherBackend is a Scala abstract class and cannot be <> directly. It is created indirectly for the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [[internal-registries]] .LauncherBackend's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | _isConnected a| [[_isConnected]][[isConnected]] Flag that says whether...FIXME (true) or not (false)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | clientThread a| [[clientThread]] Java's https://docs.oracle.com/javase/8/docs/api/java/lang/Thread.html[java.lang.Thread]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | connection a| [[connection]] BackendConnection

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | lastState a| [[lastState]] SparkAppHandle.State

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [[implementations]] LauncherBackend is <> (as an anonymous class) for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Spark on YARN's <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Spark local's <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Spark on Mesos' <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Spark Standalone's <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                === [[close]] Closing -- close Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"local/LauncherBackend/#source-scala_2","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#close-unit","title":"close(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                close...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: close is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                === [[connect]] Connecting -- connect Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"local/LauncherBackend/#source-scala_3","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#connect-unit","title":"connect(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                connect...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"local/LauncherBackend/#note","title":"[NOTE]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                connect is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Spark Standalone's StandaloneSchedulerBackend is requested to <> (in client deploy mode)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Spark local's LocalSchedulerBackend is <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Spark on Mesos' MesosCoarseGrainedSchedulerBackend is requested to <> (in client deploy mode)"},{"location":"local/LauncherBackend/#spark-on-yarns-client-is-requested-to","title":"* Spark on YARN's Client is requested to <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  === [[fireStopRequest]] fireStopRequest Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"local/LauncherBackend/#source-scala_4","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#firestoprequest-unit","title":"fireStopRequest(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  fireStopRequest...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: fireStopRequest is used exclusively when BackendConnection is requested to handle a Stop message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  === [[onDisconnected]] Handling Disconnects From Scheduling Backend -- onDisconnected Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"local/LauncherBackend/#source-scala_5","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#ondisconnected-unit","title":"onDisconnected(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  onDisconnected does nothing by default and is expected to be overriden by <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: onDisconnected is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  === [[setAppId]] setAppId Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"local/LauncherBackend/#source-scala_6","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#setappidappid-string-unit","title":"setAppId(appId: String): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  setAppId...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: setAppId is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  === [[setState]] setState Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"local/LauncherBackend/#source-scala_7","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#setstatestate-sparkapphandlestate-unit","title":"setState(state: SparkAppHandle.State): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  setState...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: setState is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"local/LocalEndpoint/","title":"LocalEndpoint","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  LocalEndpoint is the ThreadSafeRpcEndpoint for LocalSchedulerBackend and is registered under the LocalSchedulerBackendEndpoint name.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"local/LocalEndpoint/#review-me","title":"Review Me","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  LocalEndpoint is <> exclusively when LocalSchedulerBackend is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Put simply, LocalEndpoint is the communication channel between <> and <>. LocalEndpoint is a (thread-safe) rpc:RpcEndpoint.md[RpcEndpoint] that hosts an <> (with driver ID and localhost hostname) for Spark local mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[messages]] .LocalEndpoint's RPC Messages [cols=\"1,3\",options=\"header\",width=\"100%\"] |=== | Message | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | <> | Requests the <> to executor:Executor.md#killTask[kill a given task]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | <> | Calls <> <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | <> | Requests the <> to executor:Executor.md#stop[stop]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When a LocalEndpoint starts up (as part of Spark local's initialization) it prints out the following INFO messages to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  INFO Executor: Starting executor ID driver on host localhost\nINFO Executor: Using REPL class URI: http://192.168.1.4:56131\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[executor]] LocalEndpoint creates a single executor:Executor.md[] with the following properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[localExecutorId]] driver ID for the executor:Executor.md#executorId[executor ID]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[localExecutorHostname]] localhost for the executor:Executor.md#executorHostname[hostname]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • <> for the executor:Executor.md#userClassPath[user-defined CLASSPATH]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • executor:Executor.md#isLocal[isLocal] flag enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • The <> is then used when LocalEndpoint is requested to handle <> and <> RPC messages, and <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    [[internal-registries]] .LocalEndpoint's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | freeCores a| [[freeCores]] The number of CPU cores that are free to use (to schedule tasks)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Default: Initial <> (aka totalCores)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Increments when LocalEndpoint is requested to handle <> RPC message with a finished state

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Decrements when LocalEndpoint is requested to <> and there were tasks to execute

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NOTE: A single task to execute costs scheduler:TaskSchedulerImpl.md#CPUS_PER_TASK[spark.task.cpus] configuration (default: 1).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when LocalEndpoint is requested to <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    [[logging]] [TIP] ==== Enable INFO logging level for org.apache.spark.scheduler.local.LocalEndpoint logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    log4j.logger.org.apache.spark.scheduler.local.LocalEndpoint=INFO\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"local/LocalEndpoint/#refer-to-spark-loggingmd-logging","title":"Refer to <<../spark-logging.md#, Logging>>.","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    === [[creating-instance]] Creating LocalEndpoint Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    LocalEndpoint takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • [[rpcEnv]] <<../index.md#, RpcEnv>>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • [[userClassPath]] User-defined class path (Seq[URL]) that is the <> configuration property and is used exclusively to create the <>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • [[scheduler]] scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • [[executorBackend]] <>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • [[totalCores]] Number of CPU cores (aka totalCores)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • LocalEndpoint initializes the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      === [[receive]] Processing Receive-Only RPC Messages -- receive Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"local/LocalEndpoint/#source-scala","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#receive-partialfunctionany-unit","title":"receive: PartialFunction[Any, Unit]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: receive is part of the rpc:RpcEndpoint.md#receive[RpcEndpoint] abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      receive handles (processes) <>, <>, and <> RPC messages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ==== [[ReviveOffers]] ReviveOffers RPC Message

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"local/LocalEndpoint/#source-scala_1","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#reviveoffers","title":"ReviveOffers()","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      When <>, LocalEndpoint <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: ReviveOffers RPC message is sent out exclusively when LocalSchedulerBackend is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ==== [[StatusUpdate]] StatusUpdate RPC Message

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"local/LocalEndpoint/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      StatusUpdate( taskId: Long, state: TaskState, serializedData: ByteBuffer)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      When <>, LocalEndpoint requests the <> to scheduler:TaskSchedulerImpl.md#statusUpdate[handle a task status update] (given the taskId, the task state and the data).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      If the given scheduler:Task.md#TaskState[TaskState] is a finished state (one of FINISHED, FAILED, KILLED, LOST states), LocalEndpoint adds scheduler:TaskSchedulerImpl.md#CPUS_PER_TASK[spark.task.cpus] configuration (default: 1) to the <> registry followed by <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: StatusUpdate RPC message is sent out exclusively when LocalSchedulerBackend is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ==== [[KillTask]] KillTask RPC Message

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"local/LocalEndpoint/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      KillTask( taskId: Long, interruptThread: Boolean, reason: String)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      When <>, LocalEndpoint requests the single <> to executor:Executor.md#killTask[kill a task] (given the taskId, the interruptThread flag and the reason).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: KillTask RPC message is sent out exclusively when LocalSchedulerBackend is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      === [[reviveOffers]] Reviving Offers -- reviveOffers Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"local/LocalEndpoint/#source-scala_4","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#reviveoffers-unit","title":"reviveOffers(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      reviveOffers...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: reviveOffers is used when LocalEndpoint is requested to <> (namely <> and <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      === [[receiveAndReply]] Processing Receive-Reply RPC Messages -- receiveAndReply Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"local/LocalEndpoint/#source-scala_5","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#receiveandreplycontext-rpccallcontext-partialfunctionany-unit","title":"receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: receiveAndReply is part of the rpc:RpcEndpoint.md#receiveAndReply[RpcEndpoint] abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      receiveAndReply handles (processes) <> RPC message exclusively.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ==== [[StopExecutor]] StopExecutor RPC Message

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"local/LocalEndpoint/#source-scala_6","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#stopexecutor","title":"StopExecutor()","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      When <>, LocalEndpoint requests the single <> to executor:Executor.md#stop[stop] and requests the given RpcCallContext to reply with true (as the response).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: StopExecutor RPC message is sent out exclusively when LocalSchedulerBackend is requested to <>."},{"location":"local/LocalSchedulerBackend/","title":"LocalSchedulerBackend","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      LocalSchedulerBackend is a SchedulerBackend and an ExecutorBackend for Spark local deployment.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Master URL Total CPU Cores local 1 local[n] n local[*] The number of available CPU cores on the local machine local[n, m] n CPU cores and m task retries local[*, m] The number of available CPU cores on the local machine and m task retries

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"local/LocalSchedulerBackend/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      LocalSchedulerBackend takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskSchedulerImpl
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Total number of CPU cores

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        LocalSchedulerBackend is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkContext is requested to create a Spark Scheduler (for local master URL)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • KubernetesClusterManager (Spark on Kubernetes) is requested for a SchedulerBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"local/LocalSchedulerBackend/#maxNumConcurrentTasks","title":"Maximum Number of Concurrent Tasks","text":"SchedulerBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        maxNumConcurrentTasks(\n  rp: ResourceProfile): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        maxNumConcurrentTasks is part of the SchedulerBackend abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        maxNumConcurrentTasks calculates the number of CPU cores per task for the given ResourceProfile (and this SparkConf).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the end, maxNumConcurrentTasks is the total CPU cores available divided by the number of CPU cores per task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"local/LocalSchedulerBackend/#logging","title":"Logging","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Enable ALL logging level for org.apache.spark.scheduler.local.LocalSchedulerBackend logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        logger.LocalSchedulerBackend.name = org.apache.spark.scheduler.local.LocalSchedulerBackend\nlogger.LocalSchedulerBackend.level = all\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"memory/","title":"Memory System","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Memory System is a core component of Apache Spark that is based on UnifiedMemoryManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"memory/#resources","title":"Resources","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SPARK-10000: Consolidate storage and execution memory management
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"memory/#videos","title":"Videos","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Deep Dive: Apache Spark Memory Management
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Deep Dive into Project Tungsten
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Spark Performance: What's Next
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"memory/ExecutionMemoryPool/","title":"ExecutionMemoryPool","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ExecutionMemoryPool is a MemoryPool.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"memory/ExecutionMemoryPool/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ExecutionMemoryPool takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Lock Object
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MemoryMode (ON_HEAP or OFF_HEAP)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ExecutionMemoryPool is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • MemoryManager is created (and initializes on-heap and off-heap execution memory pools)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"memory/ExecutionMemoryPool/#acquiring-memory","title":"Acquiring Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          acquireMemory(\n  numBytes: Long,\n  taskAttemptId: Long,\n  maybeGrowPool: Long => Unit = (additionalSpaceNeeded: Long) => (),\n  computeMaxPoolSize: () => Long = () => poolSize): Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          acquireMemory...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          acquireMemory\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • UnifiedMemoryManager is requested to acquire execution memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"memory/MemoryAllocator/","title":"MemoryAllocator","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          MemoryAllocator is an abstraction of memory allocators that TaskMemoryManager uses to allocate and release memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          MemoryAllocator creates the available MemoryAllocators to be available under the names HEAP and UNSAFE.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          A MemoryAllocator to use is selected when MemoryManager is created (based on MemoryMode).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"memory/MemoryAllocator/#contract","title":"Contract","text":""},{"location":"memory/MemoryAllocator/#allocating-contiguous-block-of-memory","title":"Allocating Contiguous Block of Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          MemoryBlock allocate(\n  long size)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskMemoryManager is requested to allocate a memory page
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"memory/MemoryAllocator/#releasing-memory","title":"Releasing Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          void free(\n  MemoryBlock memory)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskMemoryManager is requested to release a memory page and clean up all the allocated memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"memory/MemoryAllocator/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • HeapMemoryAllocator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • UnsafeMemoryAllocator"},{"location":"memory/MemoryConsumer/","title":"MemoryConsumer","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            MemoryConsumer is an abstraction of memory consumers (of TaskMemoryManager) that support spilling.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            MemoryConsumers correspond to individual operators and data structures within a task. TaskMemoryManager receives memory allocation requests from MemoryConsumers and issues callbacks to consumers in order to trigger spilling when running low on memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A MemoryConsumer basically tracks how much memory is allocated.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"memory/MemoryConsumer/#contract","title":"Contract","text":""},{"location":"memory/MemoryConsumer/#spilling","title":"Spilling
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            void spill() // (1)\nlong spill(\n  long size,\n  MemoryConsumer trigger)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. Uses MAX_VALUE for the size and this MemoryConsumer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskMemoryManager is requested to acquire execution memory (and trySpillAndAcquire)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffleExternalSorter is requested to growPointerArrayIfNecessary, insertRecord
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • UnsafeExternalSorter is requested to createWithExistingInMemorySorter, growPointerArrayIfNecessary, insertRecord, merge
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"memory/MemoryConsumer/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BytesToBytesMap
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffleExternalSorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Spillable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • UnsafeExternalSorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • a few others
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"memory/MemoryConsumer/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            MemoryConsumer takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskMemoryManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Page Size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MemoryMode (ON_HEAP or OFF_HEAP) Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MemoryConsumer\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete MemoryConsumers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"memory/MemoryManager/","title":"MemoryManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MemoryManager is an abstraction of memory managers that can share available memory between tasks (TaskMemoryManager) and storage (BlockManager).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MemoryManager splits assigned memory into two regions:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Execution Memory for shuffles, joins, sorts and aggregations

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Storage Memory for caching and propagating internal data across Spark nodes (in on- and off-heap modes)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MemoryManager is used to create BlockManager (and MemoryStore) and TaskMemoryManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"memory/MemoryManager/#contract","title":"Contract","text":""},{"location":"memory/MemoryManager/#acquiring-execution-memory-for-task","title":"Acquiring Execution Memory for Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              acquireExecutionMemory(\n  numBytes: Long,\n  taskAttemptId: Long,\n  memoryMode: MemoryMode): Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • TaskMemoryManager is requested to acquire execution memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"memory/MemoryManager/#acquiring-storage-memory-for-block","title":"Acquiring Storage Memory for Block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              acquireStorageMemory(\n  blockId: BlockId,\n  numBytes: Long,\n  memoryMode: MemoryMode): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • MemoryStore is requested for the putBytes and putIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"memory/MemoryManager/#acquiring-unroll-memory-for-block","title":"Acquiring Unroll Memory for Block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              acquireUnrollMemory(\n  blockId: BlockId,\n  numBytes: Long,\n  memoryMode: MemoryMode): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • MemoryStore is requested for the reserveUnrollMemoryForThisTask
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"memory/MemoryManager/#total-available-off-heap-storage-memory","title":"Total Available Off-Heap Storage Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              maxOffHeapStorageMemory: Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              May vary over time

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • MemoryStore is requested for the maxMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"memory/MemoryManager/#total-available-on-heap-storage-memory","title":"Total Available On-Heap Storage Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              maxOnHeapStorageMemory: Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              May vary over time

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • MemoryStore is requested for the maxMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"memory/MemoryManager/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • UnifiedMemoryManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"memory/MemoryManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MemoryManager takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Number of CPU Cores
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Size of the On-Heap Storage Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Size of the On-Heap Execution Memory Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MemoryManager\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete MemoryManagers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"memory/MemoryManager/#SparkEnv","title":"Accessing MemoryManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MemoryManager is available as SparkEnv.memoryManager on the driver and executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                import org.apache.spark.SparkEnv\nval mm = SparkEnv.get.memoryManager\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                // MemoryManager is private[spark]\n// the following won't work unless within org.apache.spark package\n// import org.apache.spark.memory.MemoryManager\n// assert(mm.isInstanceOf[MemoryManager])\n\n// we have to revert to string comparision \ud83d\ude14\nassert(\"UnifiedMemoryManager\".equals(mm.getClass.getSimpleName))\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"memory/MemoryManager/#associating-memorystore-with-storage-memory-pools","title":"Associating MemoryStore with Storage Memory Pools
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                setMemoryStore(\n  store: MemoryStore): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                setMemoryStore requests the on-heap and off-heap storage memory pools to use the given MemoryStore.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                setMemoryStore\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"memory/MemoryManager/#execution-memory-pools","title":"Execution Memory Pools","text":""},{"location":"memory/MemoryManager/#on-heap","title":"On-Heap
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                onHeapExecutionMemoryPool: ExecutionMemoryPool\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MemoryManager creates an ExecutionMemoryPool for ON_HEAP memory mode when created and immediately requests it to incrementPoolSize to onHeapExecutionMemory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"memory/MemoryManager/#off-heap","title":"Off-Heap
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                offHeapExecutionMemoryPool: ExecutionMemoryPool\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MemoryManager creates an ExecutionMemoryPool for OFF_HEAP memory mode when created and immediately requests it to incrementPoolSize to...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"memory/MemoryManager/#storage-memory-pools","title":"Storage Memory Pools","text":""},{"location":"memory/MemoryManager/#on-heap_1","title":"On-Heap
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                onHeapStorageMemoryPool: StorageMemoryPool\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MemoryManager creates a StorageMemoryPool for ON_HEAP memory mode when created and immediately requests it to incrementPoolSize to onHeapExecutionMemory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                onHeapStorageMemoryPool is requested to setMemoryStore when MemoryManager is requested to setMemoryStore.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                onHeapStorageMemoryPool is requested to release memory when MemoryManager is requested to release on-heap storage memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                onHeapStorageMemoryPool is requested to release all memory when MemoryManager is requested to release all storage memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                onHeapStorageMemoryPool is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MemoryManager is requested for the storageMemoryUsed and onHeapStorageMemoryUsed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • UnifiedMemoryManager is requested to acquire on-heap execution and storage memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"memory/MemoryManager/#off-heap_1","title":"Off-Heap
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                offHeapStorageMemoryPool: StorageMemoryPool\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MemoryManager creates a StorageMemoryPool for OFF_HEAP memory mode when created and immediately requested it to incrementPoolSize to offHeapStorageMemory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MemoryManager requests the MemoryPools to use a given MemoryStore when requested to setMemoryStore.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MemoryManager requests the MemoryPools to release memory when requested to releaseStorageMemory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MemoryManager requests the MemoryPools to release all memory when requested to release all storage memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MemoryManager requests the MemoryPools for the memoryUsed when requested for storageMemoryUsed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                offHeapStorageMemoryPool is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MemoryManager is requested for the offHeapStorageMemoryUsed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • UnifiedMemoryManager is requested to acquire off-heap execution and storage memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"memory/MemoryManager/#total-storage-memory-used","title":"Total Storage Memory Used
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                storageMemoryUsed: Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                storageMemoryUsed is the sum of the memory used of the on-heap and off-heap storage memory pools.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                storageMemoryUsed\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • TaskMemoryManager is requested to showMemoryUsage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MemoryStore is requested to memoryUsed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"memory/MemoryManager/#memorymode","title":"MemoryMode
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                tungstenMemoryMode: MemoryMode\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                tungstenMemoryMode tracks whether Tungsten memory will be allocated on the JVM heap or off-heap (using sun.misc.Unsafe).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                final val

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                tungstenMemoryMode is a final value so initialized once when MemoryManager is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                tungstenMemoryMode is OFF_HEAP when the following are all met:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • spark.memory.offHeap.enabled configuration property is enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • spark.memory.offHeap.size configuration property is greater than 0

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • JVM supports unaligned memory access (aka unaligned Unsafe, i.e. sun.misc.Unsafe package is available and the underlying system has unaligned-access capability)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Otherwise, tungstenMemoryMode is ON_HEAP.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Given that spark.memory.offHeap.enabled configuration property is turned off by default and spark.memory.offHeap.size configuration property is 0 by default, Apache Spark seems to encourage using Tungsten memory allocated on the JVM heap (ON_HEAP).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                tungstenMemoryMode is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MemoryManager is created (and initializes the pageSizeBytes and tungstenMemoryAllocator internal properties)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • TaskMemoryManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"memory/MemoryManager/#memoryallocator","title":"MemoryAllocator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                tungstenMemoryAllocator: MemoryAllocator\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MemoryManager selects the MemoryAllocator to use based on the MemoryMode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                final val

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                tungstenMemoryAllocator is a final value so initialized once when MemoryManager is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MemoryMode MemoryAllocator ON_HEAP HeapMemoryAllocator OFF_HEAP UnsafeMemoryAllocator

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                tungstenMemoryAllocator is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • TaskMemoryManager is requested to allocate a memory page, release a memory page and clean up all the allocated memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"memory/MemoryManager/#pageSizeBytes","title":"Page Size

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                pageSizeBytes is either spark.buffer.pageSize, if defined, or the default page size.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                pageSizeBytes is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • TaskMemoryManager is requested for the page size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"memory/MemoryManager/#defaultPageSizeBytes","title":"Default Page Size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                defaultPageSizeBytes: Long\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Lazy Value

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                defaultPageSizeBytes is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Learn more in the Scala Language Specification.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"memory/MemoryPool/","title":"MemoryPool","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MemoryPool is an abstraction of memory pools.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"memory/MemoryPool/#contract","title":"Contract","text":""},{"location":"memory/MemoryPool/#size-of-memory-used","title":"Size of Memory Used
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                memoryUsed: Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MemoryPool is requested for the amount of free memory and decrementPoolSize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"memory/MemoryPool/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExecutionMemoryPool
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • StorageMemoryPool
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"memory/MemoryPool/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MemoryPool takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Lock Object Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  MemoryPool\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete MemoryPools.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"memory/MemoryPool/#free-memory","title":"Free Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  memoryFree\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  memoryFree...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  memoryFree\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutionMemoryPool is requested to acquireMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • StorageMemoryPool is requested to acquireMemory and freeSpaceToShrinkPool
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • UnifiedMemoryManager is requested to acquire execution and storage memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"memory/MemoryPool/#decrementpoolsize","title":"decrementPoolSize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  decrementPoolSize(\n  delta: Long): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  decrementPoolSize...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  decrementPoolSize\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • UnifiedMemoryManager is requested to acquireExecutionMemory and acquireStorageMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"memory/StorageMemoryPool/","title":"StorageMemoryPool","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  StorageMemoryPool is a MemoryPool.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"memory/StorageMemoryPool/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  StorageMemoryPool takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Lock Object
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • MemoryMode (ON_HEAP or OFF_HEAP)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    StorageMemoryPool is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • MemoryManager is created (and initializes on-heap and off-heap storage memory pools)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"memory/StorageMemoryPool/#memorystore","title":"MemoryStore

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    StorageMemoryPool is given a MemoryStore when MemoryManager is requested to associate one with the on- and off-heap storage memory pools.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    StorageMemoryPool uses the MemoryStore (to evict blocks) when requested to:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Acquire Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Free Space to Shrink Pool
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"memory/StorageMemoryPool/#size-of-memory-used","title":"Size of Memory Used

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    StorageMemoryPool keeps track of the size of the memory acquired.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The size descreases when StorageMemoryPool is requested to releaseMemory or releaseAllMemory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    memoryUsed is part of the MemoryPool abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"memory/StorageMemoryPool/#acquiring-memory","title":"Acquiring Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    acquireMemory(\n  blockId: BlockId,\n  numBytes: Long): Boolean\nacquireMemory(\n  blockId: BlockId,\n  numBytesToAcquire: Long,\n  numBytesToFree: Long): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    acquireMemory...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    acquireMemory\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • UnifiedMemoryManager is requested to acquire storage memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"memory/StorageMemoryPool/#freeing-space-to-shrink-pool","title":"Freeing Space to Shrink Pool
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    freeSpaceToShrinkPool(\n  spaceToFree: Long): Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    freeSpaceToShrinkPool...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    freeSpaceToShrinkPool\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • UnifiedMemoryManager is requested to acquire execution memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"memory/TaskMemoryManager/","title":"TaskMemoryManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskMemoryManager manages the memory allocated to a single task (using MemoryManager).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskMemoryManager assumes that:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    1. The number of bits to address pages is 13
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    2. The number of bits to encode offsets in pages is 51 (64 bits - 13 bits)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    3. Number of pages in the page table and to be allocated is 8192 (1 << 13)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    4. The maximum page size is 15GB (((1L << 31) - 1) * 8L)"},{"location":"memory/TaskMemoryManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskMemoryManager takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • MemoryManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Task Attempt ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskMemoryManager is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskRunner is requested to run

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"memory/TaskMemoryManager/#memorymanager","title":"MemoryManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskMemoryManager is given a MemoryManager when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskMemoryManager uses the MemoryManager\u00a0when requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Acquiring, releasing or cleaning up execution memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Report memory usage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • pageSizeBytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Allocating a memory block for Tungsten consumers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • freePage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • getMemoryConsumptionForThisTask
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#page-table-memoryblocks","title":"Page Table (MemoryBlocks)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskMemoryManager uses an array of MemoryBlocks (to mimic an operating system's page table).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The page table uses 13 bits for addressing pages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        A page is \"stored\" in allocatePage and \"removed\" in freePage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        All pages are released (removed) in cleanUpAllAllocatedMemory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskMemoryManager uses the page table when requested to:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • getPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • getOffsetInPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#spillable-memory-consumers","title":"Spillable Memory Consumers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        HashSet<MemoryConsumer> consumers\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskMemoryManager tracks spillable memory consumers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskMemoryManager registers a new memory consumer when requested to acquire execution memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskMemoryManager removes (clears) all registered memory consumers when cleaning up all the allocated memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Memory consumers are used to report memory usage when TaskMemoryManager is requested to show memory usage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#memory-acquired-but-not-used","title":"Memory Acquired But Not Used

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskMemoryManager tracks the size of memory allocated but not used (by any of the MemoryConsumers due to a OutOfMemoryError upon trying to use it).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskMemoryManager releases the memory when cleaning up all the allocated memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#allocated-pages","title":"Allocated Pages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        BitSet allocatedPages\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskMemoryManager uses a BitSet (Java) to track allocated pages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The size is exactly the number of entries in the page table (8192).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#memorymode","title":"MemoryMode

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskMemoryManager can be in ON_HEAP or OFF_HEAP mode (to avoid extra work for off-heap and hoping that the JIT handles branching well).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskMemoryManager is given the MemoryMode matching the MemoryMode (of the given MemoryManager) when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskMemoryManager uses the MemoryMode to match to for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • allocatePage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • cleanUpAllAllocatedMemory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        For OFF_HEAP mode, TaskMemoryManager has to change offset while encodePageNumberAndOffset and getOffsetInPage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        For OFF_HEAP mode, TaskMemoryManager returns no page.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The MemoryMode is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleExternalSorter is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BytesToBytesMap is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • UnsafeExternalSorter is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Spillable is requested to spill (only when in ON_HEAP mode)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#acquiring-execution-memory","title":"Acquiring Execution Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        long acquireExecutionMemory(\n  long required,\n  MemoryConsumer consumer)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        acquireExecutionMemory allocates up to required execution memory (bytes) for the MemoryConsumer (from the MemoryManager).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When not enough memory could be allocated initially, acquireExecutionMemory requests every consumer (with the same MemoryMode, itself including) to spill.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        acquireExecutionMemory returns the amount of memory allocated.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        acquireExecutionMemory\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MemoryConsumer is requested to acquire execution memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskMemoryManager is requested to allocate a page

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        acquireExecutionMemory requests the MemoryManager to acquire execution memory (with required bytes, the taskAttemptId and the MemoryMode of the MemoryConsumer).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the end, acquireExecutionMemory registers the MemoryConsumer (and adds it to the consumers registry) and prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Task [taskAttemptId] acquired [got] for [consumer]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In case MemoryManager will have offerred less memory than required, acquireExecutionMemory finds the MemoryConsumers (in the consumers registry) with the MemoryMode and non-zero memory used, sorts them by memory usage, requests them (one by one) to spill until enough memory is acquired or there are no more consumers to release memory from (by spilling).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When a MemoryConsumer releases memory, acquireExecutionMemory prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Task [taskAttemptId] released [released] from [c] for [consumer]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In case there is still not enough memory (less than required), acquireExecutionMemory requests the MemoryConsumer (to acquire memory for) to spill.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        acquireExecutionMemory prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Task [taskAttemptId] released [released] from itself ([consumer])\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#releasing-execution-memory","title":"Releasing Execution Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        void releaseExecutionMemory(\n  long size,\n  MemoryConsumer consumer)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        releaseExecutionMemory prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Task [taskAttemptId] release [size] from [consumer]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the end, releaseExecutionMemory requests the MemoryManager to releaseExecutionMemory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        releaseExecutionMemory is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MemoryConsumer is requested to free up memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskMemoryManager is requested to allocatePage and freePage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#pageSizeBytes","title":"Page Size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        long pageSizeBytes()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        pageSizeBytes requests the MemoryManager for the page size.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        pageSizeBytes is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MemoryConsumer is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleExternalSorter is created (as a MemoryConsumer)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#reporting-memory-usage","title":"Reporting Memory Usage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        void showMemoryUsage()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        showMemoryUsage prints out the following INFO message to the logs (with the taskAttemptId):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Memory used in task [taskAttemptId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        showMemoryUsage requests every MemoryConsumer to report memory used. For consumers with non-zero memory usage, showMemoryUsage prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Acquired by [consumer]: [memUsage]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        showMemoryUsage requests the MemoryManager to getExecutionMemoryUsageForTask to calculate memory not accounted for (that is not associated with a specific consumer).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        showMemoryUsage prints out the following INFO messages to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        [memoryNotAccountedFor] bytes of memory were used by task [taskAttemptId] but are not associated with specific consumers\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        showMemoryUsage requests the MemoryManager for the executionMemoryUsed and storageMemoryUsed and prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        [executionMemoryUsed] bytes of memory are used for execution and\n[storageMemoryUsed] bytes of memory are used for storage\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        showMemoryUsage is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MemoryConsumer is requested to throw an OutOfMemoryError
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#cleaning-up-all-allocated-memory","title":"Cleaning Up All Allocated Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        long cleanUpAllAllocatedMemory()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The consumers collection is then cleared.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        cleanUpAllAllocatedMemory finds all the registered MemoryConsumers (in the consumers registry) that still keep some memory used and, for every such consumer, prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        unreleased [getUsed] memory from [consumer]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        cleanUpAllAllocatedMemory removes all the consumers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        For every MemoryBlock in the pageTable, cleanUpAllAllocatedMemory prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        unreleased page: [page] in task [taskAttemptId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        cleanUpAllAllocatedMemory marks the pages to be freed (FREED_IN_TMM_PAGE_NUMBER) and requests the MemoryManager for the tungstenMemoryAllocator to free up the MemoryBlock.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        cleanUpAllAllocatedMemory clears the pageTable registry (by assigning null values).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        cleanUpAllAllocatedMemory requests the MemoryManager to release execution memory that is not used by any consumer (with the acquiredButNotUsed and the tungstenMemoryMode).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the end, cleanUpAllAllocatedMemory requests the MemoryManager to release all execution memory for the task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        cleanUpAllAllocatedMemory\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskRunner is requested to run a task (and the task has finished successfully)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#allocating-memory-page","title":"Allocating Memory Page
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        MemoryBlock allocatePage(\n  long size,\n  MemoryConsumer consumer)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        allocatePage allocates a block of memory (page) that is:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        1. Below MAXIMUM_PAGE_SIZE_BYTES maximum size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        2. For MemoryConsumers with the same MemoryMode as the TaskMemoryManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        allocatePage acquireExecutionMemory (for the size and the MemoryConsumer). allocatePage returns immediately (with null) when this allocation ended up with 0 or less bytes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        allocatePage allocates the first clear bit in the allocatedPages (unless the whole page table is taken and allocatePage throws an IllegalStateException).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        allocatePage requests the MemoryManager for the tungstenMemoryAllocator that is requested to allocate the acquired memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        allocatePage registers the page in the pageTable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the end, allocatePage prints out the following TRACE message to the logs and returns the MemoryBlock allocated.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Allocate page number [pageNumber] ([acquired] bytes)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#usage","title":"Usage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        allocatePage is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MemoryConsumer is requested to allocate an array and a page
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#toolargepageexception","title":"TooLargePageException

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        For sizes larger than the MAXIMUM_PAGE_SIZE_BYTES allocatePage throws a TooLargePageException.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#outofmemoryerror","title":"OutOfMemoryError

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Requesting the tungstenMemoryAllocator to allocate the acquired memory may throw an OutOfMemoryError. If so, allocatePage prints out the following WARN message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Failed to allocate a page ([acquired] bytes), try again.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        allocatePage adds the acquired memory to the acquiredButNotUsed and removes the page from the allocatedPages (by clearing the bit).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the end, allocatePage tries to allocate the page again (recursively).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#releasing-memory-page","title":"Releasing Memory Page
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        void freePage(\n  MemoryBlock page,\n  MemoryConsumer consumer)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        pageSizeBytes requests the MemoryManager for pageSizeBytes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        pageSizeBytes is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MemoryConsumer is requested to freePage and throwOom
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#getting-page","title":"Getting Page
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Object getPage(\n  long pagePlusOffsetAddress)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getPage handles the ON_HEAP mode of the tungstenMemoryMode only.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getPage looks up the page (by the given address) in the page table and requests it for the base object.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getPage is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleExternalSorter is requested to writeSortedFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Location (of BytesToBytesMap) is requested to updateAddressesAndSizes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SortComparator (of UnsafeInMemorySorter) is requested to compare two record pointers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SortedIterator (of UnsafeInMemorySorter) is requested to loadNext record
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#getoffsetinpage","title":"getOffsetInPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        long getOffsetInPage(\n  long pagePlusOffsetAddress)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getOffsetInPage gives the offset associated with the given pagePlusOffsetAddress (encoded by encodePageNumberAndOffset).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getOffsetInPage is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleExternalSorter is requested to writeSortedFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Location (of BytesToBytesMap) is requested to updateAddressesAndSizes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SortComparator (of UnsafeInMemorySorter) is requested to compare two record pointers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SortedIterator (of UnsafeInMemorySorter) is requested to loadNext record
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/TaskMemoryManager/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Enable ALL logging level for org.apache.spark.memory.TaskMemoryManager logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        log4j.logger.org.apache.spark.memory.TaskMemoryManager=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"memory/UnifiedMemoryManager/","title":"UnifiedMemoryManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        UnifiedMemoryManager is a MemoryManager (with the onHeapExecutionMemory being the Maximum Heap Memory with the onHeapStorageRegionSize taken out).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        UnifiedMemoryManager allows for soft boundaries between storage and execution memory (allowing requests for memory in one region to be fulfilled by borrowing memory from the other).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"memory/UnifiedMemoryManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        UnifiedMemoryManager takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Maximum Heap Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Size of the On-Heap Storage Region
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Number of CPU Cores

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          While being created, UnifiedMemoryManager asserts the invariants.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          UnifiedMemoryManager is created\u00a0using apply factory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"memory/UnifiedMemoryManager/#invariants","title":"Invariants

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          UnifiedMemoryManager asserts the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Sum of the pool size of the on-heap ExecutionMemoryPool and on-heap StorageMemoryPool is exactly the maximum heap memory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Sum of the pool size of the off-heap ExecutionMemoryPool and off-heap StorageMemoryPool is exactly the maximum off-heap memory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"memory/UnifiedMemoryManager/#total-available-on-heap-memory-for-storage","title":"Total Available On-Heap Memory for Storage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          maxOnHeapStorageMemory: Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          maxOnHeapStorageMemory\u00a0is part of the MemoryManager abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          maxOnHeapStorageMemory is the difference between Maximum Heap Memory and the memory used in the on-heap execution memory pool.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"memory/UnifiedMemoryManager/#size-of-the-on-heap-storage-memory","title":"Size of the On-Heap Storage Memory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          UnifiedMemoryManager is given the size of the on-heap storage memory (region) when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The size is the fraction (based on spark.memory.storageFraction configuration property) of the maximum heap memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The remaining memory space (of the maximum heap memory) is used for the on-heap execution memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"memory/UnifiedMemoryManager/#creating-unifiedmemorymanager","title":"Creating UnifiedMemoryManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          apply(\n  conf: SparkConf,\n  numCores: Int): UnifiedMemoryManager\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          apply creates a UnifiedMemoryManager with the Maximum Heap Memory and the size of the on-heap storage region as spark.memory.storageFraction of the Maximum Memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          apply\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkEnv utility is used to create a base SparkEnv (for the driver and executors)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"memory/UnifiedMemoryManager/#maximum-heap-memory","title":"Maximum Heap Memory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          UnifiedMemoryManager is given the maximum heap memory to use (for execution and storage) when created (that uses apply factory method which uses getMaxMemory).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          UnifiedMemoryManager makes sure that the driver's system memory is at least 1.5 of the Reserved System Memory. Otherwise, getMaxMemory throws an IllegalArgumentException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          System memory [systemMemory] must be at least [minSystemMemory].\nPlease increase heap size using the --driver-memory option or spark.driver.memory in Spark configuration.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          UnifiedMemoryManager makes sure that the executor memory (spark.executor.memory) is at least the Reserved System Memory. Otherwise, getMaxMemory throws an IllegalArgumentException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Executor memory [executorMemory] must be at least [minSystemMemory].\nPlease increase executor memory using the --executor-memory option or spark.executor.memory in Spark configuration.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          UnifiedMemoryManager considers \"usable\" memory to be the system memory without the reserved memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          UnifiedMemoryManager uses the fraction (based on spark.memory.fraction configuration property) of the \"usable\" memory for the maximum heap memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"memory/UnifiedMemoryManager/#demo","title":"Demo
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          // local mode with --conf spark.driver.memory=2g\nscala> sc.getConf.getSizeAsBytes(\"spark.driver.memory\")\nres0: Long = 2147483648\n\nscala> val systemMemory = Runtime.getRuntime.maxMemory\n\n// fixed amount of memory for non-storage, non-execution purposes\n// UnifiedMemoryManager.RESERVED_SYSTEM_MEMORY_BYTES\nval reservedMemory = 300 * 1024 * 1024\n\n// minimum system memory required\nval minSystemMemory = (reservedMemory * 1.5).ceil.toLong\n\nval usableMemory = systemMemory - reservedMemory\n\nval memoryFraction = sc.getConf.getDouble(\"spark.memory.fraction\", 0.6)\nscala> val maxMemory = (usableMemory * memoryFraction).toLong\nmaxMemory: Long = 956615884\n\nimport org.apache.spark.network.util.JavaUtils\nscala> JavaUtils.byteStringAsMb(maxMemory + \"b\")\nres1: Long = 912\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"memory/UnifiedMemoryManager/#reserved-system-memory","title":"Reserved System Memory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          UnifiedMemoryManager considers 300MB (300 * 1024 * 1024 bytes) as a reserved system memory while calculating the maximum heap memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"memory/UnifiedMemoryManager/#acquiring-execution-memory-for-task","title":"Acquiring Execution Memory for Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          acquireExecutionMemory(\n  numBytes: Long,\n  taskAttemptId: Long,\n  memoryMode: MemoryMode): Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          acquireExecutionMemory asserts the invariants.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          acquireExecutionMemory selects the execution and storage pools, the storage region size and the maximum memory for the given MemoryMode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          MemoryMode ON_HEAP OFF_HEAP executionPool onHeapExecutionMemoryPool offHeapExecutionMemoryPool storagePool onHeapStorageMemoryPool offHeapStorageMemoryPool storageRegionSize onHeapStorageRegionSize offHeapStorageMemory maxMemory maxHeapMemory maxOffHeapMemory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In the end, acquireExecutionMemory requests the ExecutionMemoryPool to acquire memory of numBytes bytes (with the maybeGrowExecutionPool and the maximum size of execution pool functions).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          acquireExecutionMemory\u00a0is part of the MemoryManager abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"memory/UnifiedMemoryManager/#maybegrowexecutionpool","title":"maybeGrowExecutionPool
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          maybeGrowExecutionPool(\n  extraMemoryNeeded: Long): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          maybeGrowExecutionPool...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"memory/UnifiedMemoryManager/#maximum-size-of-execution-pool","title":"Maximum Size of Execution Pool
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          computeMaxExecutionPoolSize(): Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          computeMaxExecutionPoolSize takes the minimum size of the storage memory regions (based on the memory mode, ON_HEAP or OFF_HEAP, respectively):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Memory used of the on-heap or the off-heap storage memory pool
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • On-heap or the off-heap storage memory size

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In the end, computeMaxExecutionPoolSize returns the size of the remaining memory space of the maximum memory (the maxHeapMemory or the maxOffHeapMemory for ON_HEAP or OFF_HEAP memory mode, respectively) without (the minimum size of) the storage memory region.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"memory/UnsafeExternalSorter/","title":"UnsafeExternalSorter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          UnsafeExternalSorter is a MemoryConsumer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"memory/UnsafeExternalSorter/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          UnsafeExternalSorter takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskMemoryManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BlockManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SerializerManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • RecordComparator Supplier
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • PrefixComparator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Initial Size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Page size (in bytes)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • numElementsForSpillThreshold
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • UnsafeInMemorySorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • canUseRadixSort flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            UnsafeExternalSorter is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • UnsafeExternalSorter utility is used to createWithExistingInMemorySorter and create
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"memory/UnsafeExternalSorter/#createwithexistinginmemorysorter","title":"createWithExistingInMemorySorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            UnsafeExternalSorter createWithExistingInMemorySorter(\n  TaskMemoryManager taskMemoryManager,\n  BlockManager blockManager,\n  SerializerManager serializerManager,\n  TaskContext taskContext,\n  Supplier<RecordComparator> recordComparatorSupplier,\n  PrefixComparator prefixComparator,\n  int initialSize,\n  long pageSizeBytes,\n  int numElementsForSpillThreshold,\n  UnsafeInMemorySorter inMemorySorter,\n  long existingMemoryConsumption)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createWithExistingInMemorySorter...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createWithExistingInMemorySorter\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • UnsafeKVExternalSorter is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"memory/UnsafeExternalSorter/#create","title":"create
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            UnsafeExternalSorter create(\n  TaskMemoryManager taskMemoryManager,\n  BlockManager blockManager,\n  SerializerManager serializerManager,\n  TaskContext taskContext,\n  Supplier<RecordComparator> recordComparatorSupplier,\n  PrefixComparator prefixComparator,\n  int initialSize,\n  long pageSizeBytes,\n  int numElementsForSpillThreshold,\n  boolean canUseRadixSort)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            create creates a new UnsafeExternalSorter with no UnsafeInMemorySorter given (null).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            create\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • UnsafeExternalRowSorter and UnsafeKVExternalSorter are created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"memory/UnsafeInMemorySorter/","title":"UnsafeInMemorySorter","text":""},{"location":"memory/UnsafeInMemorySorter/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            UnsafeInMemorySorter takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MemoryConsumer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskMemoryManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • RecordComparator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • PrefixComparator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Long Array or Size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • canUseRadixSort flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              UnsafeInMemorySorter is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • UnsafeExternalSorter is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • UnsafeKVExternalSorter is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"memory/UnsafeSorterSpillReader/","title":"UnsafeSorterSpillReader","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              = UnsafeSorterSpillReader

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              UnsafeSorterSpillReader is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"memory/UnsafeSorterSpillWriter/","title":"UnsafeSorterSpillWriter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              = [[UnsafeSorterSpillWriter]] UnsafeSorterSpillWriter

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              UnsafeSorterSpillWriter is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/","title":"Spark Metrics","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Spark Metrics gives you execution metrics of Spark subsystems (metrics instances, e.g. the driver of a Spark application or the master of a Spark Standalone cluster).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Spark Metrics uses Dropwizard Metrics Java library for the metrics infrastructure.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Metrics is a Java library which gives you unparalleled insight into what your code does in production.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Metrics provides a powerful toolkit of ways to measure the behavior of critical components in your production environment.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/#metrics-systems","title":"Metrics Systems","text":""},{"location":"metrics/#applicationmaster","title":"applicationMaster","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Registered when ApplicationMaster (Hadoop YARN) is requested to createAllocator

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/#applications","title":"applications","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Registered when Master (Spark Standalone) is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/#driver","title":"driver","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Registered when SparkEnv is created for the driver

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/#executor","title":"executor","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Registered when SparkEnv is created for an executor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/#master","title":"master","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Registered when Master (Spark Standalone) is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/#mesos_cluster","title":"mesos_cluster","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Registered when MesosClusterScheduler (Apache Mesos) is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/#shuffleservice","title":"shuffleService","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Registered when ExternalShuffleService is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/#worker","title":"worker","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Registered when Worker (Spark Standalone) is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/#metricssystem","title":"MetricsSystem

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Spark Metrics uses MetricsSystem.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsSystem uses Dropwizard Metrics' MetricRegistry that acts as the integration point between Spark and the metrics library.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              A Spark subsystem can access the MetricsSystem through the SparkEnv.metricsSystem property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              val metricsSystem = SparkEnv.get.metricsSystem\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"metrics/#metricsconfig","title":"MetricsConfig

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsConfig is the configuration of the MetricsSystem (i.e. metrics spark-metrics-Source.md[sources] and spark-metrics-Sink.md[sinks]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              metrics.properties is the default metrics configuration file. It is configured using spark-metrics-properties.md#spark.metrics.conf[spark.metrics.conf] configuration property. The file is first loaded from the path directly before using Spark's CLASSPATH.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsConfig also accepts a metrics configuration using spark.metrics.conf.-prefixed configuration properties.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Spark comes with conf/metrics.properties.template file that is a template of metrics configuration.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"metrics/#metricsservlet-metrics-sink","title":"MetricsServlet Metrics Sink

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Among the metrics sinks is spark-metrics-MetricsServlet.md[MetricsServlet] that is used when sink.servlet metrics sink is configured in spark-metrics-MetricsConfig.md[metrics configuration].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              CAUTION: FIXME Describe configuration files and properties

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"metrics/#jmxsink-metrics-sink","title":"JmxSink Metrics Sink

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Enable org.apache.spark.metrics.sink.JmxSink in spark-metrics-MetricsConfig.md[metrics configuration].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              You can then use jconsole to access Spark metrics through JMX.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"metrics/#json-uri-path","title":"JSON URI Path

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Metrics System is available at http://localhost:4040/metrics/json (for the default setup of a Spark application).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              $ http --follow http://localhost:4040/metrics/json\nHTTP/1.1 200 OK\nCache-Control: no-cache, no-store, must-revalidate\nContent-Length: 2200\nContent-Type: text/json;charset=utf-8\nDate: Sat, 25 Feb 2017 14:14:16 GMT\nServer: Jetty(9.2.z-SNAPSHOT)\nX-Frame-Options: SAMEORIGIN\n\n{\n    \"counters\": {\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.fileCacheHits\": {\n            \"count\": 0\n        },\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.filesDiscovered\": {\n            \"count\": 0\n        },\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.hiveClientCalls\": {\n            \"count\": 2\n        },\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.parallelListingJobCount\": {\n            \"count\": 0\n        },\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.partitionsFetched\": {\n            \"count\": 0\n        }\n    },\n    \"gauges\": {\n    ...\n    \"timers\": {\n        \"app-20170225151406-0000.driver.DAGScheduler.messageProcessingTime\": {\n            \"count\": 0,\n            \"duration_units\": \"milliseconds\",\n            \"m15_rate\": 0.0,\n            \"m1_rate\": 0.0,\n            \"m5_rate\": 0.0,\n            \"max\": 0.0,\n            \"mean\": 0.0,\n            \"mean_rate\": 0.0,\n            \"min\": 0.0,\n            \"p50\": 0.0,\n            \"p75\": 0.0,\n            \"p95\": 0.0,\n            \"p98\": 0.0,\n            \"p99\": 0.0,\n            \"p999\": 0.0,\n            \"rate_units\": \"calls/second\",\n            \"stddev\": 0.0\n        }\n    },\n    \"version\": \"3.0.0\"\n}\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: You can access a Spark subsystem's MetricsSystem using its corresponding \"leading\" port, e.g. 4040 for the driver, 8080 for Spark Standalone's master and applications.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: You have to use the trailing slash (/) to have the output.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"metrics/#spark-standalone-master","title":"Spark Standalone Master
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              $ http http://192.168.1.4:8080/metrics/master/json/path\nHTTP/1.1 200 OK\nCache-Control: no-cache, no-store, must-revalidate\nContent-Length: 207\nContent-Type: text/json;charset=UTF-8\nServer: Jetty(8.y.z-SNAPSHOT)\nX-Frame-Options: SAMEORIGIN\n\n{\n    \"counters\": {},\n    \"gauges\": {\n        \"master.aliveWorkers\": {\n            \"value\": 0\n        },\n        \"master.apps\": {\n            \"value\": 0\n        },\n        \"master.waitingApps\": {\n            \"value\": 0\n        },\n        \"master.workers\": {\n            \"value\": 0\n        }\n    },\n    \"histograms\": {},\n    \"meters\": {},\n    \"timers\": {},\n    \"version\": \"3.0.0\"\n}\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"metrics/JvmSource/","title":"JvmSource","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              JvmSource is a metrics source.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The name of the source is jvm.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              JvmSource registers the build-in Codahale metrics:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • GarbageCollectorMetricSet
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • MemoryUsageGaugeSet
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BufferPoolMetricSet

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Among the metrics is total.committed (from MemoryUsageGaugeSet) that describes the current usage of the heap and non-heap memories.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/MetricsConfig/","title":"MetricsConfig","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsConfig is the configuration of the MetricsSystem (i.e. metrics sources and sinks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsConfig is <> when MetricsSystem is.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsConfig uses metrics.properties as the default metrics configuration file. It is configured using spark-metrics-properties.md#spark.metrics.conf[spark.metrics.conf] configuration property. The file is first loaded from the path directly before using Spark's CLASSPATH.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsConfig accepts a metrics configuration using spark.metrics.conf.-prefixed configuration properties.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Spark comes with conf/metrics.properties.template file that is a template of metrics configuration.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsConfig <> that the <> are always defined.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [[default-properties]] .MetricsConfig's Default Metrics Properties [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | *.sink.servlet.class | org.apache.spark.metrics.sink.MetricsServlet

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | *.sink.servlet.path | /metrics/json

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | master.sink.servlet.path | /metrics/master/json

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | applications.sink.servlet.path | /metrics/applications/json |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/MetricsConfig/#note","title":"[NOTE]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The order of precedence of metrics configuration settings is as follows:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              . <> . spark-metrics-properties.md#spark.metrics.conf[spark.metrics.conf] configuration property or metrics.properties configuration file . spark.metrics.conf.-prefixed Spark properties ====

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [[creating-instance]] [[conf]] MetricsConfig takes a SparkConf.md[SparkConf] when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [[internal-registries]] .MetricsConfig's Internal Registries and Counters [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | [[properties]] properties | https://docs.oracle.com/javase/8/docs/api/java/util/Properties.html[java.util.Properties] with metrics properties

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used to <> per-subsystem's <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | [[perInstanceSubProperties]] perInstanceSubProperties | Lookup table of metrics properties per subsystem |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              === [[initialize]] Initializing MetricsConfig -- initialize Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/MetricsConfig/#source-scala","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#initialize-unit","title":"initialize(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              initialize <> and <> (that is defined using spark-metrics-properties.md#spark.metrics.conf[spark.metrics.conf] configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              initialize takes all Spark properties that start with spark.metrics.conf. prefix from <> and adds them to <> (without the prefix).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, initialize splits <> with the default configuration (denoted as *) assigned to all subsystems afterwards.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: initialize accepts * (star) for the default configuration or any combination of lower- and upper-case letters for Spark subsystem names.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: initialize is used exclusively when MetricsSystem is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              === [[setDefaultProperties]] setDefaultProperties Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/MetricsConfig/#source-scala_1","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#setdefaultpropertiesprop-properties-unit","title":"setDefaultProperties(prop: Properties): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              setDefaultProperties sets the <> (in the input prop).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: setDefaultProperties is used exclusively when MetricsConfig <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              === [[loadPropertiesFromFile]] Loading Custom Metrics Configuration File or metrics.properties -- loadPropertiesFromFile Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/MetricsConfig/#source-scala_2","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#loadpropertiesfromfilepath-optionstring-unit","title":"loadPropertiesFromFile(path: Option[String]): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              loadPropertiesFromFile tries to open the input path file (if defined) or the default metrics configuration file metrics.properties (on CLASSPATH).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              If either file is available, loadPropertiesFromFile loads the properties (to <> registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In case of exceptions, you should see the following ERROR message in the logs followed by the exception.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ERROR Error loading configuration file [file]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: loadPropertiesFromFile is used exclusively when MetricsConfig <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              === [[subProperties]] Grouping Properties Per Subsystem -- subProperties Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/MetricsConfig/#source-scala_3","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#subpropertiesprop-properties-regex-regex-mutablehashmapstring-properties","title":"subProperties(prop: Properties, regex: Regex): mutable.HashMap[String, Properties]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              subProperties takes prop properties and destructures keys given regex. subProperties takes the matching prefix (of a key per regex) and uses it as a new key with the value(s) being the matching suffix(es).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/MetricsConfig/#source-scala_4","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#driverhelloworld-driver-helloworld","title":"driver.hello.world => (driver, (hello.world))","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: subProperties is used when MetricsConfig <> (to apply the default metrics configuration) and when MetricsSystem registers metrics sources and sinks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              === [[getInstance]] getInstance Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/MetricsConfig/#source-scala_5","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#getinstanceinst-string-properties","title":"getInstance(inst: String): Properties","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getInstance...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: getInstance is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/MetricsServlet/","title":"MetricsServlet JSON Metrics Sink","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsServlet is a metrics sink that gives metrics snapshots in JSON format.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsServlet is a \"special\" sink as it is only available to the metrics instances with a web UI:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Driver of a Spark application
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Spark Standalone's Master and Worker

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              You can access the metrics from MetricsServlet at /metrics/json URI by default. The entire URL depends on a metrics instance, e.g. http://localhost:4040/metrics/json/ for a running Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              $ http http://localhost:4040/metrics/json/\nHTTP/1.1 200 OK\nCache-Control: no-cache, no-store, must-revalidate\nContent-Length: 5005\nContent-Type: text/json;charset=utf-8\nDate: Mon, 11 Jun 2018 06:29:03 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nX-Content-Type-Options: nosniff\nX-Frame-Options: SAMEORIGIN\nX-XSS-Protection: 1; mode=block\n\n{\n    \"counters\": {\n        \"local-1528698499919.driver.HiveExternalCatalog.fileCacheHits\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.HiveExternalCatalog.filesDiscovered\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.HiveExternalCatalog.hiveClientCalls\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.HiveExternalCatalog.parallelListingJobCount\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.HiveExternalCatalog.partitionsFetched\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.LiveListenerBus.numEventsPosted\": {\n            \"count\": 7\n        },\n        \"local-1528698499919.driver.LiveListenerBus.queue.appStatus.numDroppedEvents\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.LiveListenerBus.queue.executorManagement.numDroppedEvents\": {\n            \"count\": 0\n        }\n    },\n    ...\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsServlet is <> exclusively when MetricsSystem is started (and requested to register metrics sinks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsServlet can be configured using configuration properties with sink.servlet prefix (in spark-metrics-MetricsConfig.md[metrics configuration]). That is not required since MetricsConfig spark-metrics-MetricsConfig.md#setDefaultProperties[makes sure] that MetricsServlet is always configured.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsServlet uses https://fasterxml.github.io/jackson-databind/[jackson-databind], the general data-binding package for Jackson (as <>) with Dropwizard Metrics library (i.e. registering a Coda Hale MetricsModule).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [[properties]] .MetricsServlet's Configuration Properties [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Name | Default | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | path | /metrics/json/ | [[path]] Path URI prefix to bind to

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | sample | false | [[sample]] Whether to show entire set of samples for histograms |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [[internal-registries]] .MetricsServlet's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | mapper | [[mapper]] Jaxson's https://fasterxml.github.io/jackson-databind/javadoc/2.6/com/fasterxml/jackson/databind/ObjectMapper.html[com.fasterxml.jackson.databind.ObjectMapper] that \"provides functionality for reading and writing JSON, either to and from basic POJOs (Plain Old Java Objects), or to and from a general-purpose JSON Tree Model (JsonNode), as well as related functionality for performing conversions.\"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              When created, mapper is requested to register a Coda Hale com.codahale.metrics.json.MetricsModule.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used exclusively when MetricsServlet is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | servletPath | [[servletPath]] Value of <> configuration property

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | servletShowSample | [[servletShowSample]] Flag to control whether to show samples (true) or not (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              servletShowSample is the value of <> configuration property (if defined) or false.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when <> is requested to register a Coda Hale com.codahale.metrics.json.MetricsModule. |==="},{"location":"metrics/MetricsServlet/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsServlet takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • [[property]] Configuration Properties (as Java Properties)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • [[registry]] MetricRegistry (Dropwizard Metrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • [[securityMgr]] SecurityManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsServlet initializes the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              === [[getMetricsSnapshot]] Requesting Metrics Snapshot -- getMetricsSnapshot Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/MetricsServlet/#source-scala","title":"[source, scala]","text":""},{"location":"metrics/MetricsServlet/#getmetricssnapshotrequest-httpservletrequest-string","title":"getMetricsSnapshot(request: HttpServletRequest): String","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getMetricsSnapshot simply requests the <> to serialize the <> to a JSON string (using ++https://fasterxml.github.io/jackson-databind/javadoc/2.6/com/fasterxml/jackson/databind/ObjectMapper.html#writeValueAsString-java.lang.Object-++[ObjectMapper.writeValueAsString]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: getMetricsSnapshot is used exclusively when MetricsServlet is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              === [[getHandlers]] Requesting JSON Servlet Handler -- getHandlers Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/MetricsServlet/#source-scala_1","title":"[source, scala]","text":""},{"location":"metrics/MetricsServlet/#gethandlersconf-sparkconf-arrayservletcontexthandler","title":"getHandlers(conf: SparkConf): Array[ServletContextHandler]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getHandlers returns just a single ServletContextHandler (in a collection) that gives <> in JSON format at every request at <> URI path.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NOTE: getHandlers is used exclusively when MetricsSystem is requested for MetricsSystem.md#getServletHandlers[metrics ServletContextHandlers].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/MetricsSystem/","title":"MetricsSystem","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsSystem is a registry of metrics sources and sinks of a Spark subsystem.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"metrics/MetricsSystem/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricsSystem takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Instance Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SecurityManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                While being created, MetricsSystem requests the MetricsConfig to initialize.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MetricsSystem is created (using createMetricsSystem utility) for the Metrics Systems.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"metrics/MetricsSystem/#prometheusservlet","title":"PrometheusServlet

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MetricsSystem creates a PrometheusServlet when requested to registerSinks for an instance with sink.prometheusServlet configuration.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MetricsSystem requests the PrometheusServlet for URL handlers when requested for servlet handlers (so it can be attached to a web UI and serve HTTP requests).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"metrics/MetricsSystem/#metricsservlet","title":"MetricsServlet

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                review me

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MetricsServlet JSON metrics sink that is only available for the <> with a web UI (i.e. the driver of a Spark application and Spark Standalone's Master).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MetricsSystem may have at most one MetricsServlet JSON metrics sink (which is registered by default).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Initialized when MetricsSystem registers <> (and finds a configuration entry with servlet sink name).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when MetricsSystem is requested for a <>.","text":""},{"location":"metrics/MetricsSystem/#creating-metricssystem","title":"Creating MetricsSystem

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                createMetricsSystem(\n  instance: String\n  conf: SparkConf\n  securityMgr: SecurityManager): MetricsSystem\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                createMetricsSystem creates a new MetricsSystem (for the given parameters).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                createMetricsSystem is used to create metrics systems.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"metrics/MetricsSystem/#metrics-sources-for-spark-sql","title":"Metrics Sources for Spark SQL
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • CodegenMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • HiveCatalogMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"metrics/MetricsSystem/#registering-metrics-source","title":"Registering Metrics Source
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerSource(\n  source: Source): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerSource adds source to the sources internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerSource creates an identifier for the metrics source and registers it with the MetricRegistry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerSource registers the metrics source under a given name.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerSource prints out the following INFO message to the logs when registering a name more than once:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Metrics already registered\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"metrics/MetricsSystem/#building-metrics-source-identifier","title":"Building Metrics Source Identifier
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                buildRegistryName(\n  source: Source): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                buildRegistryName uses spark-metrics-properties.md#spark.metrics.namespace[spark.metrics.namespace] and executor:Executor.md#spark.executor.id[spark.executor.id] Spark properties to differentiate between a Spark application's driver and executors, and the other Spark framework's components.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                (only when <> is driver or executor) buildRegistryName builds metrics source name that is made up of spark-metrics-properties.md#spark.metrics.namespace[spark.metrics.namespace], executor:Executor.md#spark.executor.id[spark.executor.id] and the name of the source.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                FIXME Finish for the other components.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                buildRegistryName is used when MetricsSystem is requested to register or remove a metrics source.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"metrics/MetricsSystem/#registering-metrics-sources-for-spark-instance","title":"Registering Metrics Sources for Spark Instance
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerSources(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerSources finds <> configuration for the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: instance is defined when MetricsSystem <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerSources finds the configuration of all the spark-metrics-Source.md[metrics sources] for the subsystem (as described with source. prefix).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For every metrics source, registerSources finds class property, creates an instance, and in the end <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When registerSources fails, you should see the following ERROR message in the logs followed by the exception.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Source class [classPath] cannot be instantiated\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerSources is used when MetricsSystem is requested to start.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"metrics/MetricsSystem/#servlet-handlers","title":"Servlet Handlers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getServletHandlers: Array[ServletContextHandler]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getServletHandlers requests the metricsServlet (if defined) and the prometheusServlet (if defined) for URL handlers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getServletHandlers requires that the MetricsSystem is running or throws an IllegalArgumentException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Can only call getServletHandlers on a running MetricsSystem\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getServletHandlers is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkContext is created (and attaches the URL handlers to the web UI)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Master (Spark Standalone) is requested to onStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Worker (Spark Standalone) is requested to onStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"metrics/MetricsSystem/#registering-metrics-sinks","title":"Registering Metrics Sinks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerSinks(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerSinks requests the <> for the spark-metrics-MetricsConfig.md#getInstance[configuration] of the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerSinks requests the <> for the spark-metrics-MetricsConfig.md#subProperties[configuration] of all metrics sinks (i.e. configuration entries that match ^sink\\\\.(.+)\\\\.(.+) regular expression).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For every metrics sink configuration, registerSinks takes class property and (if defined) creates an instance of the metric sink using an constructor that takes the configuration, <> and <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For a single servlet metrics sink, registerSinks converts the sink to a spark-metrics-MetricsServlet.md[MetricsServlet] and sets the <> internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For all other metrics sinks, registerSinks adds the sink to the <> internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In case of an Exception, registerSinks prints out the following ERROR message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Sink class [classPath] cannot be instantiated\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerSinks is used when MetricsSystem is requested to start.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"metrics/MetricsSystem/#stopping","title":"Stopping
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                stop...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"metrics/MetricsSystem/#reporting-metrics","title":"Reporting Metrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                report(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                report simply requests the registered metrics sinks to report metrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"metrics/MetricsSystem/#starting","title":"Starting
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                start(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                start turns <> flag on.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: start can only be called once and <> an IllegalArgumentException when called multiple times.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                start <> the <> for Spark SQL, i.e. CodegenMetrics and HiveCatalogMetrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                start then registers the configured metrics <> and <> for the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In the end, start requests the registered <> to spark-metrics-Sink.md#start[start].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                [[start-IllegalArgumentException]] start throws an IllegalArgumentException when <> flag is on.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                requirement failed: Attempting to start a MetricsSystem that is already running\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"metrics/MetricsSystem/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Enable ALL logging level for org.apache.spark.metrics.MetricsSystem logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                log4j.logger.org.apache.spark.metrics.MetricsSystem=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"metrics/MetricsSystem/#internal-registries","title":"Internal Registries","text":""},{"location":"metrics/MetricsSystem/#metricregistry","title":"MetricRegistry

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Integration point to Dropwizard Metrics' MetricRegistry

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when MetricsSystem is requested to:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Register or remove a metrics source
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Start (that in turn registers metrics sinks)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"metrics/MetricsSystem/#metricsconfig","title":"MetricsConfig

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MetricsConfig

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Initialized when MetricsSystem is <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when MetricsSystem registers <> and <>.","text":""},{"location":"metrics/MetricsSystem/#running-flag","title":"running Flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Indicates whether MetricsSystem has been started (true) or not (false)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"metrics/MetricsSystem/#sinks","title":"sinks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Metrics sinks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when MetricsSystem <> and <>.","text":""},{"location":"metrics/MetricsSystem/#sources","title":"sources

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Metrics sources

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when MetricsSystem <>.","text":""},{"location":"metrics/PrometheusServlet/","title":"PrometheusServlet","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                PrometheusServlet is a metrics sink that comes with a ServletContextHandler to serve metrics snapshots in Prometheus format.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"metrics/PrometheusServlet/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                PrometheusServlet takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Properties
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MetricRegistry (Dropwizard Metrics)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  PrometheusServlet is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • MetricsSystem is requested to register metric sinks (with sink.prometheusServlet configuration)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"metrics/PrometheusServlet/#servletcontexthandler","title":"ServletContextHandler

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  PrometheusServlet creates a ServletContextHandler to be registered at the path configured by path property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The ServletContextHandler handles text/plain content type.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When executed, the ServletContextHandler gives a metrics snapshot.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"metrics/PrometheusServlet/#metrics-snapshot","title":"Metrics Snapshot
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getMetricsSnapshot(\n  request: HttpServletRequest): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getMetricsSnapshot...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"metrics/PrometheusServlet/#gethandlers","title":"getHandlers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getHandlers(\n  conf: SparkConf): Array[ServletContextHandler]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getHandlers is the ServletContextHandler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getHandlers is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • MetricsSystem is requested for servlet handlers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"metrics/Sink/","title":"Sink","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Sink is a <> of metrics sinks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[contract]] [source, scala]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  package org.apache.spark.metrics.sink

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  trait Sink { def start(): Unit def stop(): Unit def report(): Unit }

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: Sink is a private[spark] contract.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  .Sink Contract [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | start | [[start]] Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | stop | [[stop]] Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | report | [[report]] Used when...FIXME |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[implementations]] .Sinks [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Sink | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | ConsoleSink | [[ConsoleSink]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | CsvSink | [[CsvSink]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | GraphiteSink | [[GraphiteSink]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | JmxSink | [[JmxSink]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | spark-metrics-MetricsServlet.md[MetricsServlet] | [[MetricsServlet]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | Slf4jSink | [[Slf4jSink]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | StatsdSink | [[StatsdSink]] |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: All known <> in Spark 2.3 are in org.apache.spark.metrics.sink Scala package."},{"location":"metrics/Source/","title":"Source","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Source is an abstraction of metrics sources.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"metrics/Source/#contract","title":"Contract","text":""},{"location":"metrics/Source/#metricregistry","title":"MetricRegistry
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  metricRegistry: MetricRegistry\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  MetricRegistry (Codahale Metrics)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • MetricsSystem is requested to register a metrics source
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"metrics/Source/#source-name","title":"Source Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  sourceName: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • MetricsSystem is requested to build a metrics source identifier and getSourcesByName
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"metrics/Source/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • AccumulatorSource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • AppStatusSource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManagerSource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • DAGSchedulerSource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutorAllocationManagerSource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutorMetricsSource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExecutorSource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • JvmSource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ShuffleMetricsSource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • others
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"metrics/configuration-properties/","title":"Configuration Properties","text":""},{"location":"metrics/configuration-properties/#sparkmetricsappstatussourceenabled","title":"spark.metrics.appStatusSource.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Enables Dropwizard/Codahale metrics with the status of a live Spark application

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • AppStatusSource utility is used to create an AppStatusSource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"metrics/configuration-properties/#sparkmetricsconf","title":"spark.metrics.conf

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The metrics configuration file

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Default: metrics.properties

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"metrics/configuration-properties/#sparkmetricsexecutormetricssourceenabled","title":"spark.metrics.executorMetricsSource.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Enables registering ExecutorMetricsSource with the metrics system

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"metrics/configuration-properties/#sparkmetricsnamespace","title":"spark.metrics.namespace

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Root namespace for metrics reporting

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Default: Spark Application ID (i.e. spark.app.id configuration property)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Since a Spark application's ID changes with every execution of a Spark application, a custom namespace can be specified for an easier metrics reporting.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when MetricsSystem is requested for a metrics source identifier (metrics namespace)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"metrics/configuration-properties/#sparkmetricsstaticsourcesenabled","title":"spark.metrics.staticSources.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Enables static metric sources

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkEnv utility is used to create SparkEnv for executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"network/","title":"Network","text":""},{"location":"network/SparkTransportConf/","title":"SparkTransportConf Utility","text":""},{"location":"network/SparkTransportConf/#fromsparkconf","title":"fromSparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  fromSparkConf(\n  _conf: SparkConf,\n  module: String, // (1)\n  numUsableCores: Int = 0,\n  role: Option[String] = None): TransportConf // (2)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1. The given module is shuffle most of the time except:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • rpc for NettyRpcEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • files for NettyRpcEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  2. Only defined in NettyRpcEnv to be either driver or executor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  fromSparkConf makes a copy (clones) the given SparkConf.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  fromSparkConf sets the following configuration properties (for the given module):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • spark.[module].io.serverThreads
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • spark.[module].io.clientThreads

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The values are taken using the following properties in the order and until one is found (with suffix being serverThreads or clientThreads, respectively):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1. spark.[role].[module].io.[suffix]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  2. spark.[module].io.[suffix]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Unless found, fromSparkConf defaults to the default number of threads (based on the given numUsableCores and not more than 8).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, fromSparkConf creates a TransportConf (for the given module and the updated SparkConf).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  fromSparkConf\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkEnv utility is used to create a SparkEnv (with the spark.shuffle.service.enabled configuration property enabled)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExternalShuffleService is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • NettyBlockTransferService is requested to init
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • NettyRpcEnv is created and requested for a downloadClient
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • IndexShuffleBlockResolver is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ShuffleBlockPusher is requested to initiateBlockPush
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManager is requested to readDiskBlockFromSameHostExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"network/TransportClientFactory/","title":"TransportClientFactory","text":""},{"location":"network/TransportClientFactory/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  TransportClientFactory takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TransportContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TransportClientBootstraps

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TransportClientFactory is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • TransportContext is requested for a TransportClientFactory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"network/TransportClientFactory/#configuration-properties","title":"Configuration Properties","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    While being created, TransportClientFactory requests the given TransportContext for the TransportConf that is used to access the values of the following (configuration) properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • io.numConnectionsPerPeer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • io.mode
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • io.mode
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • io.preferDirectBufs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • io.retryWait
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spark.network.sharedByteBufAllocators.enabled
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spark.network.io.preferDirectBufs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Module Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"network/TransportClientFactory/#creating-transportclient","title":"Creating TransportClient
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TransportClient createClient(\n  String remoteHost,\n  int remotePort) // (1)\nTransportClient createClient(\n  String remoteHost,\n  int remotePort,\n  boolean fastFail)\nTransportClient createClient(\n  InetSocketAddress address)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    1. Turns fastFail off

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    createClient prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Creating new connection to [address]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    createClient creates a Netty Bootstrap and initializes it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    createClient requests the Netty Bootstrap to connect.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    If successful, createClient prints out the following DEBUG message and requests the TransportClientBootstraps to doBootstrap.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Connection to [address] successful, running bootstraps...\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In the end, createClient prints out the following INFO message:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Successfully created connection to [address] after [t] ms ([t] ms spent in bootstraps)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"network/TransportConf/","title":"TransportConf","text":""},{"location":"network/TransportConf/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TransportConf takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Module Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ConfigProvider

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TransportConf is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SparkTransportConf utility is used to fromSparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • YarnShuffleService (Spark on YARN) is requested to serviceInit
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"network/TransportConf/#module-name","title":"Module Name

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TransportConf is given the name of a module the transport-related configuration properties are for and is as follows (per SparkTransportConf):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • shuffle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • rpc for NettyRpcEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • files for NettyRpcEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"network/TransportConf/#getmodulename","title":"getModuleName
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      String getModuleName()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getModuleName returns the module name.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"network/TransportConf/#getconfkey","title":"getConfKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      String getConfKey(\n  String suffix)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getConfKey creates the key of a configuration property (with the module and the given suffix):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      spark.[module].[suffix]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"network/TransportConf/#suffixes","title":"Suffixes","text":""},{"location":"network/TransportConf/#iomode","title":"io.mode
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • nio (default)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • epoll
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"network/TransportConf/#iopreferdirectbufs","title":"io.preferDirectBufs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Controls whether Spark prefers allocating off-heap byte buffers within Netty (true) or not (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"network/TransportConf/#ioconnectiontimeout","title":"io.connectionTimeout","text":""},{"location":"network/TransportConf/#ioconnectioncreationtimeout","title":"io.connectionCreationTimeout","text":""},{"location":"network/TransportConf/#iobacklog","title":"io.backLog

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The requested maximum length of the queue of incoming connections

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Default: -1 (no backlog)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"network/TransportConf/#ionumconnectionsperpeer","title":"io.numConnectionsPerPeer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Default: 1

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"network/TransportConf/#ioserverthreads","title":"io.serverThreads","text":""},{"location":"network/TransportConf/#ioclientthreads","title":"io.clientThreads

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Default: 0

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"network/TransportConf/#ioreceivebuffer","title":"io.receiveBuffer","text":""},{"location":"network/TransportConf/#iosendbuffer","title":"io.sendBuffer","text":""},{"location":"network/TransportConf/#sasltimeout","title":"sasl.timeout","text":""},{"location":"network/TransportConf/#iomaxretries","title":"io.maxRetries","text":""},{"location":"network/TransportConf/#ioretrywait","title":"io.retryWait

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Time that we will wait in order to perform a retry after an IOException. Only relevant if maxIORetries is greater than 0.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Default: 5s

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"network/TransportConf/#iolazyfd","title":"io.lazyFD","text":""},{"location":"network/TransportConf/#ioenableverbosemetrics","title":"io.enableVerboseMetrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Enables Netty's memory detailed metrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"network/TransportConf/#ioenabletcpkeepalive","title":"io.enableTcpKeepAlive","text":""},{"location":"network/TransportConf/#preferdirectbufsforsharedbytebufallocators","title":"preferDirectBufsForSharedByteBufAllocators

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The value of spark.network.io.preferDirectBufs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"network/TransportConf/#sharedbytebufallocators","title":"sharedByteBufAllocators

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The value of spark.network.sharedByteBufAllocators.enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"network/TransportContext/","title":"TransportContext","text":""},{"location":"network/TransportContext/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TransportContext takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TransportConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • RpcHandler
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • closeIdleConnections flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • isClientOnly flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TransportContext is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExternalBlockStoreClient is requested to init
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExternalShuffleService is requested to start
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • NettyBlockTransferService is requested to init
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • NettyRpcEnv is created and requested to downloadClient
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • YarnShuffleService (Spark on YARN) is requested to serviceInit
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"network/TransportContext/#creating-server","title":"Creating Server
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TransportServer createServer(\n  int port,\n  List<TransportServerBootstrap> bootstraps)\nTransportServer createServer(\n  String host,\n  int port,\n  List<TransportServerBootstrap> bootstraps)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        createServer creates a TransportServer (with the RpcHandler and the input arguments).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        createServer\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • YarnShuffleService (Spark on YARN) is requested to serviceInit
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExternalShuffleService is requested to start
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • NettyBlockTransferService is requested to createServer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • NettyRpcEnv is requested to startServer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"network/TransportContext/#creating-transportclientfactory","title":"Creating TransportClientFactory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TransportClientFactory createClientFactory() // (1)\nTransportClientFactory createClientFactory(\n  List<TransportClientBootstrap> bootstraps)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        1. Uses empty bootstraps

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        createClientFactory creates a TransportClientFactory (with itself and the given TransportClientBootstraps).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        createClientFactory\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExternalBlockStoreClient is requested to init
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • NettyBlockTransferService is requested to init
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • NettyRpcEnv is created and requested to downloadClient
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"plugins/","title":"Plugin Framework","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Plugin Framework is an API for registering custom extensions (plugins) to be executed on the driver and executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Plugin Framework uses separate PluginContainers for the driver and executors, and spark.plugins configuration property for SparkPlugins to be registered.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Plugin Framework was introduced in Spark 2.4.4 (with an API for executors) with further changes in Spark 3.0.0 (to cover the driver).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"plugins/#resources","title":"Resources","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Advanced Instrumentation in the official documentation of Apache Spark
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Commit for SPARK-29397
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Spark Plugin Framework in 3.0 - Part 1: Introduction by Madhukara Phatak
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Spark Memory Monitor by squito
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkPlugins by Luca Canali (CERN)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"plugins/DriverPlugin/","title":"DriverPlugin","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        DriverPlugin is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"plugins/DriverPluginContainer/","title":"DriverPluginContainer","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        DriverPluginContainer is a PluginContainer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"plugins/DriverPluginContainer/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        DriverPluginContainer takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Resources (Map[String, ResourceInformation])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkPlugins

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          DriverPluginContainer is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • PluginContainer utility is used for a PluginContainer (at SparkContext startup)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"plugins/DriverPluginContainer/#registering-metrics","title":"Registering Metrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          registerMetrics(\n  appId: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          registerMetrics\u00a0is part of the PluginContainer abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          For every driver plugin, registerMetrics requests it to register metrics and the associated PluginContextImpl for the same.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"plugins/DriverPluginContainer/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Enable ALL logging level for org.apache.spark.internal.plugin.DriverPluginContainer logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          log4j.logger.org.apache.spark.internal.plugin.DriverPluginContainer=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"plugins/ExecutorPlugin/","title":"ExecutorPlugin","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ExecutorPlugin is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"plugins/ExecutorPluginContainer/","title":"ExecutorPluginContainer","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ExecutorPluginContainer is a PluginContainer for Executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"plugins/ExecutorPluginContainer/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ExecutorPluginContainer takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Resources (Map[String, ResourceInformation])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkPlugins

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExecutorPluginContainer is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • PluginContainer utility is used to create a PluginContainer (for Executors)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"plugins/ExecutorPluginContainer/#executorplugins","title":"ExecutorPlugins

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExecutorPluginContainer initializes executorPlugins internal registry of ExecutorPlugins when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"plugins/ExecutorPluginContainer/#initialization","title":"Initialization","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            executorPlugins finds all the configuration properties with spark.plugins.internal.conf. prefix (in the SparkConf) for extra configuration of every ExecutorPlugin of the given SparkPlugins.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For every SparkPlugin (in the given SparkPlugins) that defines an ExecutorPlugin, executorPlugins creates a PluginContextImpl, requests the ExecutorPlugin to init (with the PluginContextImpl and the extra configuration) and the PluginContextImpl to registerMetrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In the end, executorPlugins prints out the following INFO message to the logs (for every ExecutorPlugin):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Initialized executor component for plugin [name].\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"plugins/ExecutorPluginContainer/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Enable ALL logging level for org.apache.spark.internal.plugin.ExecutorPluginContainer logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            log4j.logger.org.apache.spark.internal.plugin.ExecutorPluginContainer=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"plugins/PluginContainer/","title":"PluginContainer","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            PluginContainer is an abstraction of plugin containers that can register metrics (for the driver and executors).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            PluginContainer is created for the driver and executors using apply utility.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"plugins/PluginContainer/#contract","title":"Contract","text":""},{"location":"plugins/PluginContainer/#listening-to-task-failures","title":"Listening to Task Failures
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            onTaskFailed(\n  failureReason: TaskFailedReason): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For ExecutorPluginContainer only

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Possible TaskFailedReasons:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskKilledException
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskKilled
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • FetchFailed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskCommitDenied
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ExceptionFailure

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskRunner is requested to run (and the task has failed)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"plugins/PluginContainer/#listening-to-task-start","title":"Listening to Task Start
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            onTaskStart(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For ExecutorPluginContainer only

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskRunner is requested to run (and the task has just started)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"plugins/PluginContainer/#listening-to-task-success","title":"Listening to Task Success
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            onTaskSucceeded(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For ExecutorPluginContainer only

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskRunner is requested to run (and the task has finished successfully)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"plugins/PluginContainer/#registering-metrics","title":"Registering Metrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            registerMetrics(\n  appId: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Registers metrics for the application ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For DriverPluginContainer only

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"plugins/PluginContainer/#shutdown","title":"Shutdown
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            shutdown(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkContext is requested to stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Executor is requested to stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"plugins/PluginContainer/#implementations","title":"Implementations","text":"Sealed Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            PluginContainer is a Scala sealed abstract class which means that all of the implementations are in the same compilation unit (a single file).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DriverPluginContainer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ExecutorPluginContainer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"plugins/PluginContainer/#creating-plugincontainer","title":"Creating PluginContainer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            // the driver\napply(\n  sc: SparkContext,\n  resources: java.util.Map[String, ResourceInformation]): Option[PluginContainer]\n// executors\napply(\n  env: SparkEnv,\n  resources: java.util.Map[String, ResourceInformation]): Option[PluginContainer]\n// private helper\napply(\n  ctx: Either[SparkContext, SparkEnv],\n  resources: java.util.Map[String, ResourceInformation]): Option[PluginContainer]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            apply creates a PluginContainer for the driver or executors (based on the type of the first input argument, i.e. SparkContext or SparkEnv, respectively).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            apply first loads the SparkPlugins defined by spark.plugins configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Only when there was at least one plugin loaded, apply creates a DriverPluginContainer or ExecutorPluginContainer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            apply is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Executor is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"plugins/PluginContextImpl/","title":"PluginContextImpl","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            PluginContextImpl is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"plugins/SparkPlugin/","title":"SparkPlugin","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkPlugin is an abstraction of custom extensions for Spark applications.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","tags":["DeveloperApi"]},{"location":"plugins/SparkPlugin/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"plugins/SparkPlugin/#driver-side-component","title":"Driver-side Component
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DriverPlugin driverPlugin()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DriverPluginContainer is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":"","tags":["DeveloperApi"]},{"location":"plugins/SparkPlugin/#executor-side-component","title":"Executor-side Component
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExecutorPlugin executorPlugin()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ExecutorPluginContainer is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":"","tags":["DeveloperApi"]},{"location":"rdd/","title":"Resilient Distributed Dataset (RDD)","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Resilient Distributed Dataset (aka RDD) is the primary data abstraction in Apache Spark and the core of Spark (that I often refer to as Spark Core).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The origins of RDD

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The original paper that gave birth to the concept of RDD is Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia, et al.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Read the paper and skip the rest of this page. You'll save a great deal of your precious time \ud83d\ude0e

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            An RDD is a description of a fault-tolerant and resilient computation over a distributed collection of records (spread over one or many partitions).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            RDDs and Scala Collections

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            RDDs are like Scala collections, and they only differ by their distribution, i.e. a RDD is computed on many JVMs while a Scala collection lives on a single JVM.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Using RDD Spark hides data partitioning and so distribution that in turn allowed them to design parallel computational framework with a higher-level programming interface (API) for four mainstream programming languages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The features of RDDs (decomposing the name):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Distributed with data residing on multiple nodes in a Spark cluster
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            From the scaladoc of org.apache.spark.rdd.RDD:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            From the original paper about RDD - Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Beside the above traits (that are directly embedded in the name of the data abstraction - RDD) it has the following additional traits:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • In-Memory, i.e. data inside RDD is stored in memory as much (size) and long (time) as possible.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Immutable or Read-Only, i.e. it does not change once created and can only be transformed using transformations to new RDDs.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action is executed that triggers the execution.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Cacheable, i.e. you can hold all the data in a persistent \"storage\" like memory (default and the most preferred) or disk (the least preferred due to access speed).
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Parallel, i.e. process data in parallel.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Typed -- RDD records have types, e.g. Long in RDD[Long] or (Int, String) in RDD[(Int, String)].
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Partitioned -- records are partitioned (split into logical partitions) and distributed across nodes in a cluster.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Location-Stickiness -- RDD can define <> to compute partitions (as close to the records as possible).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Preferred location (aka locality preferences or placement preferences or locality info) is information about the locations of RDD records (that Spark's DAGScheduler uses to place computing partitions on to have the tasks as close to the data as possible).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Computing partitions in a RDD is a distributed process by design and to achieve even data distribution as well as leverage data locality (in distributed systems like HDFS or Apache Kafka in which data is partitioned by default), they are partitioned to a fixed number of partitions - logical chunks (parts) of data. The logical division is for processing only and internally it is not divided whatsoever. Each partition comprises of records.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Partitions are the units of parallelism. You can control the number of partitions of a RDD using RDD.repartition or RDD.coalesce transformations. Spark tries to be as close to data as possible without wasting time to send data across network by means of RDD shuffling, and creates as many partitions as required to follow the storage layout and thus optimize data access. It leads to a one-to-one mapping between (physical) data in distributed data storage (e.g., HDFS or Cassandra) and partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RDDs support two kinds of operations:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • transformations - lazy operations that return another RDD.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • actions - operations that trigger computation and return values.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The motivation to create RDD were (after the authors) two types of applications that current computing frameworks handle inefficiently:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • iterative algorithms in machine learning and graph computations
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • interactive data mining tools as ad-hoc queries on the same dataset

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The goal is to reuse intermediate in-memory results across multiple data-intensive workloads with no need for copying large amounts of data over the network.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Technically, RDDs follow the contract defined by the five main intrinsic properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Parent RDDs (aka RDD dependencies)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • An array of partitions that a dataset is divided to
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • A compute function to do a computation on partitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • An optional Partitioner that defines how keys are hashed, and the pairs partitioned (for key-value RDDs)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Optional preferred locations (aka locality info), i.e. hosts for a partition where the records live or are the closest to read from

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              This RDD abstraction supports an expressive set of operations without having to modify scheduler for each one.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              An RDD is a named (by name) and uniquely identified (by id) entity in a SparkContext (available as context property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RDDs live in one and only one SparkContext that creates a logical boundary.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RDDs cannot be shared between SparkContexts.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              An RDD can optionally have a friendly name accessible using name that can be changed using =:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              scala> val ns = sc.parallelize(0 to 10)\nns: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24\n\nscala> ns.id\nres0: Int = 2\n\nscala> ns.name\nres1: String = null\n\nscala> ns.name = \"Friendly name\"\nns.name: String = Friendly name\n\nscala> ns.name\nres2: String = Friendly name\n\nscala> ns.toDebugString\nres3: String = (8) Friendly name ParallelCollectionRDD[2] at parallelize at <console>:24 []\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RDDs are a container of instructions on how to materialize big (arrays of) distributed data, and how to split it into partitions so Spark (using executors) can hold some of them.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In general data distribution can help executing processing in parallel so a task processes a chunk of data that it could eventually keep in memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Spark does jobs in parallel, and RDDs are split into partitions to be processed and written in parallel. Inside a partition, data is processed sequentially.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Saving partitions results in part-files instead of one single file (unless there is a single partition).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/#transformations","title":"Transformations","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              A transformation is a lazy operation on a RDD that returns another RDD (e.g., map, flatMap, filter, reduceByKey, join, cogroup, etc.)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Learn more in Transformations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/#actions","title":"Actions","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              An action is an operation that triggers execution of RDD transformations and returns a value (to a Spark driver - the user program).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Learn more in Actions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/#creating-rdds","title":"Creating RDDs","text":""},{"location":"rdd/#parallelize","title":"SparkContext.parallelize","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              One way to create a RDD is with SparkContext.parallelize method. It accepts a collection of elements as shown below (sc is a SparkContext instance):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              scala> val rdd = sc.parallelize(1 to 1000)\nrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:25\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              You may also want to randomize the sample data:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              scala> val data = Seq.fill(10)(util.Random.nextInt)\ndata: Seq[Int] = List(-964985204, 1662791, -1820544313, -383666422, -111039198, 310967683, 1114081267, 1244509086, 1797452433, 124035586)\n\nscala> val rdd = sc.parallelize(data)\nrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:29\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Given the reason to use Spark to process more data than your own laptop could handle, SparkContext.parallelize is mainly used to learn Spark in the Spark shell.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkContext.parallelize requires all the data to be available on a single machine - the Spark driver - that eventually hits the limits of your laptop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/#makeRDD","title":"SparkContext.makeRDD","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              scala> sc.makeRDD(0 to 1000)\nres0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at <console>:25\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/#textFile","title":"SparkContext.textFile","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              One of the easiest ways to create an RDD is to use SparkContext.textFile to read files.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              You can use the local README.md file (and then flatMap over the lines inside to have an RDD of words):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              scala> val words = sc.textFile(\"README.md\").flatMap(_.split(\"\\\\W+\")).cache\nwords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[27] at flatMap at <console>:24\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              You cache it so the computation is not performed every time you work with words.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/#rdds-in-web-ui","title":"RDDs in Web UI","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              It is quite informative to look at RDDs in the Web UI that is at http://localhost:4040 for Spark shell.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Execute the following Spark application (type all the lines in spark-shell):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              val ints = sc.parallelize(1 to 100) // (1)!\nints.setName(\"Hundred ints\")        // (2)!\nints.cache                          // (3)!\nints.count                          // (4)!\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1. Creates an RDD with hundred of numbers (with as many partitions as possible)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              2. Sets the name of the RDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              3. Caches the RDD for performance reasons that also makes it visible in Storage tab in the web UI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              4. Executes action (and materializes the RDD)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              With the above executed, you should see the following in the Web UI:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Click the name of the RDD (under RDD Name) and you will get the details of how the RDD is cached.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Execute the following Spark job and you will see how the number of partitions decreases.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ints.repartition(2).count\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/Aggregator/","title":"Aggregator","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Aggregator is a set of <> used to aggregate data using rdd:PairRDDFunctions.md#combineByKeyWithClassTag[PairRDDFunctions.combineByKeyWithClassTag] transformation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Aggregator[K, V, C] is a parameterized type of K keys, V values, and C combiner (partial) values.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [[creating-instance]][[aggregation-functions]] Aggregator transforms an RDD[(K, V)] into an RDD[(K, C)] (for a \"combined type\" C) using the functions:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • [[createCombiner]] createCombiner: V => C
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • [[mergeValue]] mergeValue: (C, V) => C
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • [[mergeCombiners]] mergeCombiners: (C, C) => C

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Aggregator is used to create a ShuffleDependency and ExternalSorter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[combineValuesByKey]] combineValuesByKey Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/Aggregator/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              combineValuesByKey( iter: Iterator[_ <: Product2[K, V]], context: TaskContext): Iterator[(K, C)]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              combineValuesByKey creates a new shuffle:ExternalAppendOnlyMap.md[ExternalAppendOnlyMap] (with the <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              combineValuesByKey requests the ExternalAppendOnlyMap to shuffle:ExternalAppendOnlyMap.md#insertAll[insert all key-value pairs] from the given iterator (that is the values of a partition).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              combineValuesByKey <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, combineValuesByKey requests the ExternalAppendOnlyMap for an shuffle:ExternalAppendOnlyMap.md#iterator[iterator of \"combined\" pairs].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              combineValuesByKey is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • rdd:PairRDDFunctions.md#combineByKeyWithClassTag[PairRDDFunctions.combineByKeyWithClassTag] transformation is used (with the same Partitioner as the RDD's)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockStoreShuffleReader is requested to shuffle:BlockStoreShuffleReader.md#read[read combined records for a reduce task] (with the Map-Size Partial Aggregation Flag off)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[combineCombinersByKey]] combineCombinersByKey Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/Aggregator/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              combineCombinersByKey( iter: Iterator[_ <: Product2[K, C]], context: TaskContext): Iterator[(K, C)]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              combineCombinersByKey...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              combineCombinersByKey is used when BlockStoreShuffleReader is requested to shuffle:BlockStoreShuffleReader.md#read[read combined records for a reduce task] (with the Map-Size Partial Aggregation Flag on).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[updateMetrics]] Updating Task Metrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/Aggregator/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              updateMetrics( context: TaskContext, map: ExternalAppendOnlyMap[_, _, _]): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              updateMetrics requests the input TaskContext for the TaskMetrics to update the metrics based on the metrics of the input ExternalAppendOnlyMap:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • executor:TaskMetrics.md#incMemoryBytesSpilled[Increment memory bytes spilled]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • executor:TaskMetrics.md#incDiskBytesSpilled[Increment disk bytes spilled]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • executor:TaskMetrics.md#incPeakExecutionMemory[Increment peak execution memory]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              updateMetrics is used when Aggregator is requested to <> and <>."},{"location":"rdd/AsyncRDDActions/","title":"AsyncRDDActions","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              AsyncRDDActions is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/CheckpointRDD/","title":"CheckpointRDD","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              CheckpointRDD is an extension of the RDD abstraction for RDDs that recovers checkpointed data from storage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              CheckpointRDD cannot be checkpointed again (and doCheckpoint, checkpoint, and localCheckpoint are simply noops).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getPartitions and compute throw an NotImplementedError and are supposed to be overriden by the implementations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/CheckpointRDD/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • LocalCheckpointRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ReliableCheckpointRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/CoGroupedRDD/","title":"CoGroupedRDD","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              CoGroupedRDD[K] is an RDD that cogroups the parent RDDs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RDD[(K, Array[Iterable[_]])]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              For each key k in parent RDDs, the resulting RDD contains a tuple with the list of values for that key.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/CoGroupedRDD/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              CoGroupedRDD takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Key-Value RDDs (Seq[RDD[_ <: Product2[K, _]]])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Partitioner

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CoGroupedRDD is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • RDD.cogroup operator is used
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"rdd/CoalescedRDD/","title":"CoalescedRDD","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CoalescedRDD is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"rdd/Dependency/","title":"Dependency","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Dependency[T] is an abstraction of dependencies between RDDs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Any time an RDD transformation (e.g. map, flatMap) is used (and RDD lineage graph is built), Dependencyies are the edges.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","tags":["DeveloperApi"]},{"location":"rdd/Dependency/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"rdd/Dependency/#rdd","title":"RDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                rdd: RDD[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • DAGScheduler is requested for the shuffle dependencies and ResourceProfiles (of an RDD)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • RDD is requested to getNarrowAncestors, cleanShuffleDependencies, firstParent, parent, toDebugString, getOutputDeterministicLevel
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":"","tags":["DeveloperApi"]},{"location":"rdd/Dependency/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • NarrowDependency
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ShuffleDependency
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","tags":["DeveloperApi"]},{"location":"rdd/Dependency/#demo","title":"Demo","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The dependencies of an RDD are available using RDD.dependencies method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                val myRdd = sc.parallelize(0 to 9).groupBy(_ % 2)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                scala> myRdd.dependencies.foreach(println)\norg.apache.spark.ShuffleDependency@41e38d89\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                scala> myRdd.dependencies.map(_.rdd).foreach(println)\nMapPartitionsRDD[6] at groupBy at <console>:39\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RDD.toDebugString is used to print out the RDD lineage in a developer-friendly way.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                scala> println(myRdd.toDebugString)\n(16) ShuffledRDD[7] at groupBy at <console>:39 []\n +-(16) MapPartitionsRDD[6] at groupBy at <console>:39 []\n    |   ParallelCollectionRDD[5] at parallelize at <console>:39 []\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","tags":["DeveloperApi"]},{"location":"rdd/HadoopRDD/","title":"HadoopRDD","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.HadoopRDD[HadoopRDD] is an RDD that provides core functionality for reading data stored in HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI using the older MapReduce API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/package-summary.html[org.apache.hadoop.mapred]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                HadoopRDD is created as a result of calling the following methods in SparkContext.md[]:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • hadoopFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • textFile (the most often used in examples!)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • sequenceFile

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Partitions are of type HadoopPartition.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When an HadoopRDD is computed, i.e. an action is called, you should see the INFO message Input split: in the logs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                scala> sc.textFile(\"README.md\").count\n...\n15/10/10 18:03:21 INFO HadoopRDD: Input split: file:/Users/jacek/dev/oss/spark/README.md:0+1784\n15/10/10 18:03:21 INFO HadoopRDD: Input split: file:/Users/jacek/dev/oss/spark/README.md:1784+1784\n...\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The following properties are set upon partition execution:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • mapred.tip.id - task id of this task's attempt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • mapred.task.id - task attempt's id
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • mapred.task.is.map as true
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • mapred.task.partition - split id
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • mapred.job.id

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Spark settings for HadoopRDD:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • spark.hadoop.cloneConf (default: false) - shouldCloneJobConf - should a Hadoop job configuration JobConf object be cloned before spawning a Hadoop job. Refer to https://issues.apache.org/jira/browse/SPARK-2546[[SPARK-2546] Configuration object thread safety issue]. When true, you should see a DEBUG message Cloning Hadoop Configuration.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                You can register callbacks on TaskContext.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                HadoopRDDs are not checkpointed. They do nothing when checkpoint() is called.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"rdd/HadoopRDD/#caution","title":"[CAUTION]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • What are InputMetrics?
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • What is JobConf?
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • What are the InputSplits: FileSplit and CombineFileSplit? * What are InputFormat and Configurable subtypes?
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • What's InputFormat's RecordReader? It creates a key and a value. What are they?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                === [[getPreferredLocations]] getPreferredLocations Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                === [[getPartitions]] getPartitions Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The number of partition for HadoopRDD, i.e. the return value of getPartitions, is calculated using InputFormat.getSplits(jobConf, minPartitions) where minPartitions is only a hint of how many partitions one may want at minimum. As a hint it does not mean the number of partitions will be exactly the number given.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For SparkContext.textFile the input format class is https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html[org.apache.hadoop.mapred.TextInputFormat].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html[javadoc of org.apache.hadoop.mapred.FileInputFormat] says:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobConf, int). Subclasses of FileInputFormat can also override the isSplitable(FileSystem, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                TIP: You may find https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L319[the sources of org.apache.hadoop.mapred.FileInputFormat.getSplits] enlightening.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"rdd/HadoopRDD/#whats-hadoop-split-input-splits-for-hadoop-reads-see-inputformatgetsplits","title":"What's Hadoop Split? input splits for Hadoop reads? See InputFormat.getSplits","text":""},{"location":"rdd/HashPartitioner/","title":"HashPartitioner","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                HashPartitioner is a Partitioner for hash-based partitioning.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Important

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                HashPartitioner places null keys in 0th partition.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                HashPartitioner is used as the default Partitioner.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"rdd/HashPartitioner/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                HashPartitioner takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Number of partitions"},{"location":"rdd/HashPartitioner/#number-of-partitions","title":"Number of Partitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  numPartitions: Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  numPartitions returns the given number of partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  numPartitions\u00a0is part of the Partitioner abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"rdd/HashPartitioner/#partition-for-key","title":"Partition for Key
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getPartition(\n  key: Any): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  For null keys getPartition simply returns 0.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  For non-null keys, getPartition uses the Object.hashCode of the key modulo the number of partitions. For negative results, getPartition adds the number of partitions to make it non-negative.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getPartition\u00a0is part of the Partitioner abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"rdd/LocalCheckpointRDD/","title":"LocalCheckpointRDD","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  LocalCheckpointRDD[T] is a CheckpointRDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"rdd/LocalCheckpointRDD/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  LocalCheckpointRDD takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • RDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • RDD ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Number of Partitions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    LocalCheckpointRDD is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • LocalRDDCheckpointData is requested to doCheckpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"rdd/LocalCheckpointRDD/#partitions","title":"Partitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getPartitions: Array[Partition]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getPartitions\u00a0is part of the RDD abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getPartitions creates a CheckpointRDDPartition for every input partition (index).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"rdd/LocalCheckpointRDD/#computing-partition","title":"Computing Partition
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    compute(\n  partition: Partition,\n  context: TaskContext): Iterator[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    compute\u00a0is part of the RDD abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    compute merely throws an SparkException (that explains the reason):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Checkpoint block [RDDBlockId] not found! Either the executor\nthat originally checkpointed this partition is no longer alive, or the original RDD is\nunpersisted. If this problem persists, you may consider using `rdd.checkpoint()`\ninstead, which is slower than local checkpointing but more fault-tolerant.\"\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"rdd/LocalRDDCheckpointData/","title":"LocalRDDCheckpointData","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    LocalRDDCheckpointData is a RDDCheckpointData.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"rdd/LocalRDDCheckpointData/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    LocalRDDCheckpointData takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • RDD

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      LocalRDDCheckpointData is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • RDD is requested to localCheckpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"rdd/LocalRDDCheckpointData/#docheckpoint","title":"doCheckpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      doCheckpoint(): CheckpointRDD[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      doCheckpoint\u00a0is part of the RDDCheckpointData abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      doCheckpoint creates a LocalCheckpointRDD with the RDD. doCheckpoint triggers caching any missing partitions (by checking availability of the RDDBlockIds for the partitions in the BlockManagerMaster).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Extra Spark Job

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      If there are any missing partitions (RDDBlockIds) doCheckpoint requests the SparkContext to run a Spark job with the RDD and the missing partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      doCheckpointmakes sure that the StorageLevel of the RDD uses disk (among other persistence storages). If not, doCheckpoint\u00a0throws an AssertionError:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Storage level [level] is not appropriate for local checkpointing\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"rdd/MapPartitionsRDD/","title":"MapPartitionsRDD","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MapPartitionsRDD[U, T] is a RDD that transforms (maps) input T records into Us using partition function.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MapPartitionsRDD is a RDD that has exactly one-to-one narrow dependency on the parent RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"rdd/MapPartitionsRDD/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MapPartitionsRDD takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Parent RDD (RDD[T])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Partition Function
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • preservesPartitioning flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • isFromBarrier Flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • isOrderSensitive flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        MapPartitionsRDD is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • PairRDDFunctions is requested to mapValues and flatMapValues
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • RDD is requested to map, flatMap, filter, glom, mapPartitions, mapPartitionsWithIndexInternal, mapPartitionsInternal, mapPartitionsWithIndex
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • RDDBarrier is requested to mapPartitions, mapPartitionsWithIndex
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/MapPartitionsRDD/#barrier-rdd","title":"Barrier RDD","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        MapPartitionsRDD can be a barrier RDD in Barrier Execution Mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/MapPartitionsRDD/#isFromBarrier","title":"isFromBarrier Flag","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        MapPartitionsRDD can be given isFromBarrier flag when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        isFromBarrier flag is assumed disabled (false) and can only be enabled (true) using RDDBarrier transformations:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • RDDBarrier.mapPartitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • RDDBarrier.mapPartitionsWithIndex
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/MapPartitionsRDD/#isBarrier_","title":"isBarrier_","text":"RDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        isBarrier_ : Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        isBarrier_ is part of the RDD abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        isBarrier_ is enabled (true) when either this MapPartitionsRDD is isFromBarrier or any of the parent RDDs is isBarrier. Otherwise, isBarrier_ is disabled (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/NarrowDependency/","title":"NarrowDependency","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NarrowDependency[T] is an extension of the Dependency abstraction for narrow dependencies (of RDD[T]s) where each partition of the child RDD depends on a small number of partitions of the parent RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#getparents","title":"getParents
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getParents(\n  partitionId: Int): Seq[Int]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The parent partitions for a given child partition

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DAGScheduler is requested for the preferred locations (of a partition of an RDD)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#implementations","title":"Implementations","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#onetoonedependency","title":"OneToOneDependency

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        OneToOneDependency is a NarrowDependency with getParents returning a single-element collection with the given partitionId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        val myRdd = sc.parallelize(0 to 9).map((_, 1))\n\nscala> :type myRdd\norg.apache.spark.rdd.RDD[(Int, Int)]\n\nscala> myRdd.dependencies.foreach(println)\norg.apache.spark.OneToOneDependency@801fe56\n\nimport org.apache.spark.OneToOneDependency\nval dep = myRdd.dependencies.head.asInstanceOf[OneToOneDependency[(_, _)]]\n\nscala> println(dep.getParents(0))\nList(0)\n\nscala> println(dep.getParents(1))\nList(1)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#prunedependency","title":"PruneDependency

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        PruneDependency is a NarrowDependency that represents a dependency between the PartitionPruningRDD and the parent RDD (with a subset of partitions of the parents).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#rangedependency","title":"RangeDependency

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        RangeDependency is a NarrowDependency that represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used in UnionRDD (SparkContext.union).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        val r1 = sc.range(0, 4)\nval r2 = sc.range(5, 9)\n\nval unioned = sc.union(r1, r2)\n\nscala> unioned.dependencies.foreach(println)\norg.apache.spark.RangeDependency@76b0e1d9\norg.apache.spark.RangeDependency@3f3e51e0\n\nimport org.apache.spark.RangeDependency\nval dep = unioned.dependencies.head.asInstanceOf[RangeDependency[(_, _)]]\n\nscala> println(dep.getParents(0))\nList(0)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NarrowDependency takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • RDD[T]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NarrowDependency is an abstract class and cannot be created directly. It is created indirectly for the concrete NarrowDependencies.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","tags":["DeveloperApi"]},{"location":"rdd/NewHadoopRDD/","title":"NewHadoopRDD","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[NewHadoopRDD]] NewHadoopRDD

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NewHadoopRDD is an rdd:index.md[RDD] of K keys and V values.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          <NewHadoopRDD is created>> when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkContext.newAPIHadoopFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkContext.newAPIHadoopRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • (indirectly) SparkContext.binaryFiles
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • (indirectly) SparkContext.wholeTextFiles

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: NewHadoopRDD is the base RDD of BinaryFileRDD and WholeTextFileRDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[getPreferredLocations]] getPreferredLocations Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[creating-instance]] Creating NewHadoopRDD Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NewHadoopRDD takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[sc]] SparkContext.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[inputFormatClass]] HDFS' InputFormat[K, V]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[keyClass]] K class name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[valueClass]] V class name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[_conf]] transient HDFS' Configuration

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NewHadoopRDD initializes the <>."},{"location":"rdd/OrderedRDDFunctions/","title":"OrderedRDDFunctions","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          class OrderedRDDFunctions[\n  K: Ordering : ClassTag,\n  V: ClassTag,\n  P <: Product2[K, V] : ClassTag]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          OrderedRDDFunctions adds extra operators to RDDs of (key, value) pairs (RDD[(K, V)]) where the K key is sortable (i.e. any key type K that has an implicit Ordering[K] in scope).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Tip

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Learn more about Ordering in the Scala Standard Library documentation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rdd/OrderedRDDFunctions/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          OrderedRDDFunctions takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • RDD of Ps

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            OrderedRDDFunctions is created using RDD.rddToOrderedRDDFunctions implicit method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"rdd/OrderedRDDFunctions/#filterbyrange","title":"filterByRange
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            filterByRange(\n  lower: K,\n  upper: K): RDD[P]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            filterByRange...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/OrderedRDDFunctions/#repartitionandsortwithinpartitions","title":"repartitionAndSortWithinPartitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            repartitionAndSortWithinPartitions(\n  partitioner: Partitioner): RDD[(K, V)]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            repartitionAndSortWithinPartitions creates a ShuffledRDD with the given Partitioner.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            repartitionAndSortWithinPartitions is a generalization of sortByKey operator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/OrderedRDDFunctions/#sortbykey","title":"sortByKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            sortByKey(\n  ascending: Boolean = true,\n  numPartitions: Int = self.partitions.length): RDD[(K, V)]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            sortByKey creates a ShuffledRDD (with the RDD and a RangePartitioner).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            sortByKey is a specialization of repartitionAndSortWithinPartitions operator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            sortByKey is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • RDD.sortBy high-level operator is used
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/PairRDDFunctions/","title":"PairRDDFunctions","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            PairRDDFunctions is an extension of RDD API for additional high-level operators to work with key-value RDDs (RDD[(K, V)]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            PairRDDFunctions is available in RDDs of key-value pairs via Scala implicit conversion.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The gist of PairRDDFunctions is combineByKeyWithClassTag.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"rdd/PairRDDFunctions/#aggregatebykey","title":"aggregateByKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            aggregateByKey[U: ClassTag](\n  zeroValue: U)(\n  seqOp: (U, V) => U,\n  combOp: (U, U) => U): RDD[(K, U)] // (1)!\naggregateByKey[U: ClassTag](\n  zeroValue: U,\n  numPartitions: Int)(seqOp: (U, V) => U,\n  combOp: (U, U) => U): RDD[(K, U)] // (2)!\naggregateByKey[U: ClassTag](\n  zeroValue: U,\n  partitioner: Partitioner)(\n  seqOp: (U, V) => U,\n  combOp: (U, U) => U): RDD[(K, U)]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. Uses the default Partitioner
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. Creates a HashPartitioner with the given numPartitions partitions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            aggregateByKey...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/PairRDDFunctions/#combinebykey","title":"combineByKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            combineByKey[C](\n  createCombiner: V => C,\n  mergeValue: (C, V) => C,\n  mergeCombiners: (C, C) => C): RDD[(K, C)]\ncombineByKey[C](\n  createCombiner: V => C,\n  mergeValue: (C, V) => C,\n  mergeCombiners: (C, C) => C,\n  numPartitions: Int): RDD[(K, C)]\ncombineByKey[C](\n  createCombiner: V => C,\n  mergeValue: (C, V) => C,\n  mergeCombiners: (C, C) => C,\n  partitioner: Partitioner,\n  mapSideCombine: Boolean = true,\n  serializer: Serializer = null): RDD[(K, C)]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. Uses the default Partitioner
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. Creates a HashPartitioner with the given numPartitions partitions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            combineByKey...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/PairRDDFunctions/#combinebykeywithclasstag","title":"combineByKeyWithClassTag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            combineByKeyWithClassTag[C](\n  createCombiner: V => C,\n  mergeValue: (C, V) => C,\n  mergeCombiners: (C, C) => C)(implicit ct: ClassTag[C]): RDD[(K, C)] // (1)!\ncombineByKeyWithClassTag[C](\n  createCombiner: V => C,\n  mergeValue: (C, V) => C,\n  mergeCombiners: (C, C) => C,\n  numPartitions: Int)(implicit ct: ClassTag[C]): RDD[(K, C)] // (2)!\ncombineByKeyWithClassTag[C](\n  createCombiner: V => C,\n  mergeValue: (C, V) => C,\n  mergeCombiners: (C, C) => C,\n  partitioner: Partitioner,\n  mapSideCombine: Boolean = true,\n  serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. Uses the default Partitioner
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. Uses a HashPartitioner (with the given numPartitions)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            combineByKeyWithClassTag creates an Aggregator for the given aggregation functions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            combineByKeyWithClassTag branches off per the given Partitioner.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the input partitioner and the RDD's are the same, combineByKeyWithClassTag simply mapPartitions on the RDD with the following arguments:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Iterator of the Aggregator

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • preservesPartitioning flag turned on

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the input partitioner is different than the RDD's, combineByKeyWithClassTag creates a ShuffledRDD (with the Serializer, the Aggregator, and the mapSideCombine flag).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/PairRDDFunctions/#usage","title":"Usage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            combineByKeyWithClassTag lays the foundation for the following high-level RDD key-value pair transformations:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • aggregateByKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • combineByKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • countApproxDistinctByKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • foldByKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • groupByKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • reduceByKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/PairRDDFunctions/#requirements","title":"Requirements

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            combineByKeyWithClassTag requires that the mergeCombiners is defined (not-null) or throws an IllegalArgumentException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            mergeCombiners must be defined\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            combineByKeyWithClassTag throws a SparkException for the keys being of type array with the mapSideCombine flag enabled:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Cannot use map-side combining with array keys.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            combineByKeyWithClassTag throws a SparkException for the keys being of type array with the partitioner being a HashPartitioner:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            HashPartitioner cannot partition array keys.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/PairRDDFunctions/#example","title":"Example
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            val nums = sc.parallelize(0 to 9, numSlices = 4)\nval groups = nums.keyBy(_ % 2)\ndef createCombiner(n: Int) = {\n  println(s\"createCombiner($n)\")\n  n\n}\ndef mergeValue(n1: Int, n2: Int) = {\n  println(s\"mergeValue($n1, $n2)\")\n  n1 + n2\n}\ndef mergeCombiners(c1: Int, c2: Int) = {\n  println(s\"mergeCombiners($c1, $c2)\")\n  c1 + c2\n}\nval countByGroup = groups.combineByKeyWithClassTag(\n  createCombiner,\n  mergeValue,\n  mergeCombiners)\nprintln(countByGroup.toDebugString)\n/*\n(4) ShuffledRDD[3] at combineByKeyWithClassTag at <console>:31 []\n +-(4) MapPartitionsRDD[1] at keyBy at <console>:25 []\n    |  ParallelCollectionRDD[0] at parallelize at <console>:24 []\n*/\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/PairRDDFunctions/#countapproxdistinctbykey","title":"countApproxDistinctByKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            countApproxDistinctByKey(\n  relativeSD: Double = 0.05): RDD[(K, Long)] // (1)!\ncountApproxDistinctByKey(\n  relativeSD: Double,\n  numPartitions: Int): RDD[(K, Long)] // (2)!\ncountApproxDistinctByKey(\n  relativeSD: Double,\n  partitioner: Partitioner): RDD[(K, Long)]\ncountApproxDistinctByKey(\n  p: Int,\n  sp: Int,\n  partitioner: Partitioner): RDD[(K, Long)]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. Uses the default Partitioner
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. Creates a HashPartitioner with the given numPartitions partitions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            countApproxDistinctByKey...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/PairRDDFunctions/#foldbykey","title":"foldByKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            foldByKey(\n  zeroValue: V)(\n  func: (V, V) => V): RDD[(K, V)] // (1)!\nfoldByKey(\n  zeroValue: V,\n  numPartitions: Int)(\n  func: (V, V) => V): RDD[(K, V)] // (2)!\nfoldByKey(\n  zeroValue: V,\n  partitioner: Partitioner)(\n  func: (V, V) => V): RDD[(K, V)]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. Uses the default Partitioner
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. Creates a HashPartitioner with the given numPartitions partitions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            foldByKey...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            foldByKey is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • RDD.treeAggregate high-level operator is used
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/PairRDDFunctions/#groupbykey","title":"groupByKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            groupByKey(): RDD[(K, Iterable[V])] // (1)!\ngroupByKey(\n  numPartitions: Int): RDD[(K, Iterable[V])] // (2)!\ngroupByKey(\n  partitioner: Partitioner): RDD[(K, Iterable[V])]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. Uses the default Partitioner
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. Creates a HashPartitioner with the given numPartitions partitions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            groupByKey...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            groupByKey is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • RDD.groupBy high-level operator is used
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/PairRDDFunctions/#partitionby","title":"partitionBy
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            partitionBy(\n  partitioner: Partitioner): RDD[(K, V)]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            partitionBy...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/PairRDDFunctions/#reducebykey","title":"reduceByKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            reduceByKey(\n  func: (V, V) => V): RDD[(K, V)] // (1)!\nreduceByKey(\n  func: (V, V) => V,\n  numPartitions: Int): RDD[(K, V)] // (2)!\nreduceByKey(\n  partitioner: Partitioner,\n  func: (V, V) => V): RDD[(K, V)]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. Uses the default Partitioner
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. Creates a HashPartitioner with the given numPartitions partitions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            reduceByKey is sort of a particular case of aggregateByKey.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            reduceByKey is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • RDD.distinct high-level operator is used
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/PairRDDFunctions/#saveasnewapihadoopfile","title":"saveAsNewAPIHadoopFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            saveAsNewAPIHadoopFile(\n  path: String,\n  keyClass: Class[_],\n  valueClass: Class[_],\n  outputFormatClass: Class[_ <: NewOutputFormat[_, _]],\n  conf: Configuration = self.context.hadoopConfiguration): Unit\nsaveAsNewAPIHadoopFile[F <: NewOutputFormat[K, V]](\n  path: String)(implicit fm: ClassTag[F]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            saveAsNewAPIHadoopFile creates a new Job (Hadoop MapReduce) for the given Configuration (Hadoop).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            saveAsNewAPIHadoopFile configures the Job (with the given keyClass, valueClass and outputFormatClass).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            saveAsNewAPIHadoopFile sets mapreduce.output.fileoutputformat.outputdir configuration property to be the given path and saveAsNewAPIHadoopDataset.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/PairRDDFunctions/#saveasnewapihadoopdataset","title":"saveAsNewAPIHadoopDataset
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            saveAsNewAPIHadoopDataset(\n  conf: Configuration): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            saveAsNewAPIHadoopDataset creates a new HadoopMapReduceWriteConfigUtil (with the given Configuration) and writes the RDD out.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Configuration should have all the relevant output params set (an output format, output paths, e.g. a table name to write to) in the same way as it would be configured for a Hadoop MapReduce job.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/ParallelCollectionRDD/","title":"ParallelCollectionRDD","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ParallelCollectionRDD is an RDD of a collection of elements with numSlices partitions and optional locationPrefs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ParallelCollectionRDD is the result of SparkContext.parallelize and SparkContext.makeRDD methods.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The data collection is split on to numSlices slices.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            It uses ParallelCollectionPartition.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"rdd/Partition/","title":"Partition","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Partition is a <> of a <> of a RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: A partition is missing when it has not be computed yet.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            [[contract]] [[index]] Partition is identified by an partition index that is a unique identifier of a partition of a RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"rdd/Partition/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/Partition/#index-int","title":"index: Int","text":""},{"location":"rdd/Partitioner/","title":"Partitioner","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Partitioner is an abstraction of partitioners that define how the elements in a key-value pair RDD are partitioned by key.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Partitioner maps keys to partition IDs (from 0 to numPartitions exclusive).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Partitioner ensures that records with the same key are in the same partition.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Partitioner is a Java Serializable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"rdd/Partitioner/#contract","title":"Contract","text":""},{"location":"rdd/Partitioner/#partition-for-key","title":"Partition for Key
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getPartition(\n  key: Any): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Partition ID for the given key

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/Partitioner/#number-of-partitions","title":"Number of Partitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            numPartitions: Int\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rdd/Partitioner/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • HashPartitioner
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • RangePartitioner
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"rdd/RDD/","title":"RDD \u2014 Description of Distributed Computation","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            RDD[T] is an abstraction of fault-tolerant resilient distributed datasets that are mere descriptions of computations over a distributed collection of records (of type T).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"rdd/RDD/#contract","title":"Contract","text":""},{"location":"rdd/RDD/#compute","title":"Computing Partition","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            compute(\n  split: Partition,\n  context: TaskContext): Iterator[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Computes the input Partition (with the TaskContext) to produce values (of type T)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            See:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LocalCheckpointRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MapPartitionsRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ReliableCheckpointRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffledRDD

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • RDD is requested to computeOrReadCheckpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"rdd/RDD/#getPartitions","title":"Partitions","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getPartitions: Array[Partition]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Partitions of this RDD

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            See:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LocalCheckpointRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MapPartitionsRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ReliableCheckpointRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffledRDD

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • RDD is requested for the partitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"rdd/RDD/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • CheckpointRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • CoalescedRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • CoGroupedRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • HadoopRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MapPartitionsRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • NewHadoopRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ParallelCollectionRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ReliableCheckpointRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffledRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • others
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"rdd/RDD/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            RDD takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Dependencies (Parent RDDs that should be computed successfully before this RDD) Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RDD\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete RDDs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RDD/#barrier-rdd","title":"Barrier RDD","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Barrier RDD is a RDD with the isBarrier flag enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ShuffledRDD can never be a barrier RDD as it overrides isBarrier method to be always disabled (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RDD/#isBarrier","title":"isBarrier","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              isBarrier(): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              isBarrier is the value of isBarrier_.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              isBarrier is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • DAGScheduler is requested to submitMissingTasks (that are either ShuffleMapStages to create ShuffleMapTasks or ResultStage to create ResultTasks)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • RDDInfo is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ShuffleDependency is requested to canShuffleMergeBeEnabled
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • DAGScheduler is requested to checkBarrierStageWithRDDChainPattern, checkBarrierStageWithDynamicAllocation, checkBarrierStageWithNumSlots, handleTaskCompletion (FetchFailed case to mark a map stage as broken)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RDD/#isBarrier_","title":"isBarrier_","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              isBarrier_ : Boolean // (1)!\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1. @transient protected lazy val

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              isBarrier_ is enabled (true) when there is at least one barrier RDD among the parent RDDs (excluding ShuffleDependencyies).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              isBarrier_ is overriden by PythonRDD and MapPartitionsRDD that both accept isFromBarrier flag.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RDD/#resourceProfile","title":"ResourceProfile (Stage-Level Scheduling)","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RDD can be assigned a ResourceProfile using RDD.withResources method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              val rdd: RDD[_] = ...\nrdd\n  .withResources(...) // request resources for a computation\n  .mapPartitions(...) // the computation\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RDD uses resourceProfile internal registry for the ResourceProfile that is undefined initially.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The ResourceProfile is available using RDD.getResourceProfile method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RDD/#withResources","title":"withResources","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              withResources(\n  rp: ResourceProfile): this.type\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              withResources sets the given ResourceProfile as the resourceProfile and requests the ResourceProfileManager to add the resource profile.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RDD/#getResourceProfile","title":"getResourceProfile","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getResourceProfile(): ResourceProfile\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getResourceProfile returns the resourceProfile (if defined) or null.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getResourceProfile is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • DAGScheduler is requested for the ShuffleDependencies and ResourceProfiles of an RDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RDD/#preferredLocations","title":"Preferred Locations (Placement Preferences of Partition)","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              preferredLocations(\n  split: Partition): Seq[String]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Final Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              preferredLocations is a Scala final method and may not be overridden in subclasses.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Learn more in the Scala Language Specification.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              preferredLocations requests the CheckpointRDD for the preferred locations for the given Partition if this RDD is checkpointed orgetPreferredLocations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              preferredLocations is a template method that uses getPreferredLocations that custom RDDs can override to specify placement preferences on their own.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              preferredLocations\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • DAGScheduler is requested for preferred locations
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RDD/#partitions","title":"Partitions","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              partitions: Array[Partition]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Final Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              partitions is a Scala final method and may not be overridden in subclasses.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Learn more in the Scala Language Specification.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              partitions requests the CheckpointRDD for the partitions if this RDD is checkpointed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Otherwise, when this RDD is not checkpointed, partitions getPartitions (and caches it in the partitions_).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getPartitions is an abstract method that custom RDDs are required to provide.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              partitions has the property that their internal index should be equal to their position in this RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              partitions\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • DAGScheduler is requested to getPreferredLocsInternal
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkContext is requested to run a job
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • others
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RDD/#dependencies","title":"dependencies","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              dependencies: Seq[Dependency[_]]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Final Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              dependencies is a Scala final method and may not be overridden in subclasses.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Learn more in the Scala Language Specification.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              dependencies branches off based on checkpointRDD (and availability of CheckpointRDD).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              With CheckpointRDD available (this RDD is checkpointed), dependencies returns a OneToOneDependency with the CheckpointRDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Otherwise, when this RDD is not checkpointed, dependencies getDependencies (and caches it in the dependencies_).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getDependencies is an abstract method that custom RDDs are required to provide.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RDD/#checkpoint","title":"Reliable Checkpointing","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpoint(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Public API

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpoint is part of the public API.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Procedure

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpoint is a procedure (returns Unit) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpoint creates a new ReliableRDDCheckpointData (with this RDD) and saves it in checkpointData registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpoint does nothing when the checkpointData registry has already been defined.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpoint throws a SparkException when the checkpoint directory is not specified:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Checkpoint directory has not been set in the SparkContext\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RDD/#checkpointData","title":"RDDCheckpointData","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpointData: Option[RDDCheckpointData[T]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RDD defines checkpointData internal registry for a RDDCheckpointData[T] (of T type of this RDD).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The checkpointData registry is undefined (None) initially when this RDD is created and can hold a value after the following RDD API operators:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RDD Operator RDDCheckpointData RDD.checkpoint ReliableRDDCheckpointData RDD.localCheckpoint LocalRDDCheckpointData

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpointData is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • isCheckpointedAndMaterialized
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • isLocallyCheckpointed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • isReliablyCheckpointed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • getCheckpointFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • doCheckpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RDD/#checkpointRDD","title":"CheckpointRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpointRDD: Option[CheckpointRDD[T]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpointRDD returns the CheckpointRDD of the RDDCheckpointData (if defined and so this RDD checkpointed).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpointRDD is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • RDD is requested for the dependencies, partitions and preferred locations (all using final methods!)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rdd/RDD/#doCheckpoint","title":"doCheckpoint","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCheckpoint(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RDD.doCheckpoint, SparkContext.runJob and Dataset.checkpoint

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCheckpoint is called every time a Spark job is submitted (using SparkContext.runJob).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              I found it quite interesting at the very least.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCheckpoint is triggered when Dataset.checkpoint operator (Spark SQL) is executed (with eager flag on) which will likely trigger one or more Spark jobs on the underlying RDD anyway.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Procedure

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCheckpoint is a procedure (returns Unit) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Does nothing unless checkpointData is defined

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              My understanding is that doCheckpoint does nothing (noop) unless the RDDCheckpointData is defined.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCheckpoint executes all the following in checkpoint scope.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCheckpoint turns the doCheckpointCalled flag on (to prevent multiple executions).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCheckpoint branches off based on whether a RDDCheckpointData is defined or not:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1. With the RDDCheckpointData defined, doCheckpoint checks out the checkpointAllMarkedAncestors flag and if enabled, doCheckpoint requests the Dependencies for the RDD that are in turn requested to doCheckpoint themselves. Otherwise, doCheckpoint requests the RDDCheckpointData to checkpoint.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              2. With the RDDCheckpointData undefined, doCheckpoint requests the Dependencies (of this RDD) for their RDDs that are in turn requested to doCheckpoint themselves (recursively).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              With the RDDCheckpointData defined, requesting doCheckpoint of the Dependencies is guarded by checkpointAllMarkedAncestors flag.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCheckpoint skips execution if called earlier.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              CheckpointRDD

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              CheckpointRDD is not checkpoint again (and does nothing when requested to do so).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCheckpoint is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkContext is requested to run a job synchronously
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RDD/#iterator","title":"iterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              iterator(\n  split: Partition,\n  context: TaskContext): Iterator[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              iterator...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Final Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              iterator is a final method and may not be overridden in subclasses. See 5.2.6 final in the Scala Language Specification.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rdd/RDD/#getorcompute","title":"getOrCompute
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getOrCompute(\n  partition: Partition,\n  context: TaskContext): Iterator[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getOrCompute...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rdd/RDD/#computeorreadcheckpoint","title":"computeOrReadCheckpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              computeOrReadCheckpoint(\n  split: Partition,\n  context: TaskContext): Iterator[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              computeOrReadCheckpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rdd/RDD/#debugging-recursive-dependencies","title":"Debugging Recursive Dependencies
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              toDebugString: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              toDebugString returns a RDD Lineage Graph.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              val wordCount = sc.textFile(\"README.md\")\n  .flatMap(_.split(\"\\\\s+\"))\n  .map((_, 1))\n  .reduceByKey(_ + _)\n\nscala> println(wordCount.toDebugString)\n(2) ShuffledRDD[21] at reduceByKey at <console>:24 []\n +-(2) MapPartitionsRDD[20] at map at <console>:24 []\n    |  MapPartitionsRDD[19] at flatMap at <console>:24 []\n    |  README.md MapPartitionsRDD[18] at textFile at <console>:24 []\n    |  README.md HadoopRDD[17] at textFile at <console>:24 []\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              toDebugString uses indentations to indicate a shuffle boundary.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The numbers in round brackets show the level of parallelism at each stage, e.g. (2) in the above output.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              scala> println(wordCount.getNumPartitions)\n2\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              With spark.logLineage enabled, toDebugString is printed out when executing an action.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              $ ./bin/spark-shell --conf spark.logLineage=true\n\nscala> sc.textFile(\"README.md\", 4).count\n...\n15/10/17 14:46:42 INFO SparkContext: Starting job: count at <console>:25\n15/10/17 14:46:42 INFO SparkContext: RDD's recursive dependencies:\n(4) MapPartitionsRDD[1] at textFile at <console>:25 []\n |  README.md HadoopRDD[0] at textFile at <console>:25 []\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rdd/RDD/#coalesce","title":"coalesce
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              coalesce(\n  numPartitions: Int,\n  shuffle: Boolean = false,\n  partitionCoalescer: Option[PartitionCoalescer] = Option.empty)\n  (implicit ord: Ordering[T] = null): RDD[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              coalesce...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              coalesce is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • RDD.repartition high-level operator is used
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rdd/RDD/#implicit-methods","title":"Implicit Methods","text":""},{"location":"rdd/RDD/#rddtoorderedrddfunctions","title":"rddToOrderedRDDFunctions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              rddToOrderedRDDFunctions[K : Ordering : ClassTag, V: ClassTag](\n  rdd: RDD[(K, V)]): OrderedRDDFunctions[K, V, (K, V)]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              rddToOrderedRDDFunctions is an Scala implicit method that creates an OrderedRDDFunctions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              rddToOrderedRDDFunctions is used (implicitly) when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • RDD.sortBy
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • PairRDDFunctions.combineByKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rdd/RDD/#withScope","title":"withScope
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              withScope[U](\n  body: => U): U\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              withScope withScope with this SparkContext.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              withScope is used for most (if not all) RDD API operators.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rdd/RDDCheckpointData/","title":"RDDCheckpointData","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RDDCheckpointData is an abstraction of information related to RDD checkpointing.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[implementations]] Available RDDCheckpointDatas

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [cols=\"30,70\",options=\"header\",width=\"100%\"] |=== | RDDCheckpointData | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | rdd:LocalRDDCheckpointData.md[LocalRDDCheckpointData] | [[LocalRDDCheckpointData]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | rdd:ReliableRDDCheckpointData.md[ReliableRDDCheckpointData] | [[ReliableRDDCheckpointData]] Reliable Checkpointing

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[creating-instance]] Creating Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RDDCheckpointData takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • [[rdd]] rdd:RDD.md[RDD]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[Serializable]] RDDCheckpointData as Serializable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RDDCheckpointData is java.io.Serializable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[cpState]] States

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • [[Initialized]] Initialized

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • [[CheckpointingInProgress]] CheckpointingInProgress

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • [[Checkpointed]] Checkpointed

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[checkpoint]] Checkpointing RDD

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RDDCheckpointData/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/RDDCheckpointData/#checkpoint-checkpointrddt","title":"checkpoint(): CheckpointRDD[T]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpoint changes the <> to <> only when in <> state. Otherwise, checkpoint does nothing and returns.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpoint <> that gives an CheckpointRDD (that is the <> internal registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpoint changes the <> to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, checkpoint requests the given <> to rdd:RDD.md#markCheckpointed[markCheckpointed].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              checkpoint is used when RDD is requested to rdd:RDD.md#doCheckpoint[doCheckpoint].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[doCheckpoint]] doCheckpoint Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RDDCheckpointData/#source-scala_1","title":"[source, scala]","text":""},{"location":"rdd/RDDCheckpointData/#docheckpoint-checkpointrddt","title":"doCheckpoint(): CheckpointRDD[T]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              doCheckpoint is used when RDDCheckpointData is requested to <>."},{"location":"rdd/RDDOperationScope/","title":"RDDOperationScope","text":""},{"location":"rdd/RDDOperationScope/#withScope","title":"withScope","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              withScope[T](\n  sc: SparkContext,\n  name: String,\n  allowNesting: Boolean,\n  ignoreParent: Boolean)(\n  body: => T): T\nwithScope[T](\n  sc: SparkContext,\n  allowNesting: Boolean = false)(\n  body: => T): T\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              name Argument Value Caller checkpoint RDD.doCheckpoint Some method name Executed without name The name of a physical operator (with no Exec suffix) SparkPlan.executeQuery (Spark SQL)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              withScope...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              withScope is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • RDD is requested to doCheckpoint and withScope (for most, if not all, RDD API operators)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkContext is requested to withScope (for most, if not all, SparkContext API operators)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkPlan (Spark SQL) is requested to executeQuery
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RangePartitioner/","title":"RangePartitioner","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RangePartitioner is a Partitioner that partitions sortable records by range into roughly equal ranges (that can be used for bucketed partitioning).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RangePartitioner is used for sortByKey operator (mostly).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rdd/RangePartitioner/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RangePartitioner takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Hint for the number of partitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Key-Value RDD (RDD[_ <: Product2[K, V]])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ascending flag (default: true)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • samplePointsPerPartitionHint (default: 20)"},{"location":"rdd/RangePartitioner/#number-of-partitions","title":"Number of Partitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                numPartitions: Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                numPartitions\u00a0is part of the Partitioner abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                numPartitions is 1 more than the length of the range bounds (since the number of range bounds is 0 for 0 or 1 partitions).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"rdd/RangePartitioner/#partition-for-key","title":"Partition for Key
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getPartition(\n  key: Any): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getPartition\u00a0is part of the Partitioner abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getPartition branches off based on the length of the range bounds.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For up to 128 range bounds, getPartition is either the first range bound (from the rangeBounds) for which the key value is greater than the value of the range bound or 128 (if no value was found among the rangeBounds). getPartition starts finding a candidate partition number from 0 and walks over the rangeBounds until a range bound for which the given key value is greater than the value of the range bound is found or there are no more rangeBounds. getPartition increments the candidate partition candidate every iteration.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For the number of the rangeBounds above 128, getPartition...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In the end, getPartition returns the candidate partition number for the ascending enabled, or flips it (to be the number of the rangeBounds minus the candidate partition number), otheriwse.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"rdd/RangePartitioner/#range-bounds","title":"Range Bounds
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                rangeBounds: Array[K]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                rangeBounds is an array of upper bounds.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For the number of partitions up to and including 1, rangeBounds is an empty array.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For more than 1 partitions, rangeBounds determines the sample size per partitions. The total sample size is the samplePointsPerPartitionHint multiplied by the number of partitions capped by 1e6. rangeBounds allows for 3x over-sample per partition.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                rangeBounds sketches the keys of the input rdd (with the sampleSizePerPartition).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                There is more going on in rangeBounds.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In the end, rangeBounds determines the bounds.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"rdd/RangePartitioner/#determinebounds","title":"determineBounds
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                determineBounds[K: Ordering](\n  candidates: ArrayBuffer[(K, Float)],\n  partitions: Int): Array[K]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                determineBounds...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"rdd/ReliableCheckpointRDD/","title":"ReliableCheckpointRDD","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ReliableCheckpointRDD is an CheckpointRDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"rdd/ReliableCheckpointRDD/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ReliableCheckpointRDD takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • [[sc]] SparkContext.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • [[checkpointPath]] Checkpoint Directory (on a Hadoop DFS-compatible file system)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • <<_partitioner, Partitioner>>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ReliableCheckpointRDD is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ReliableCheckpointRDD utility is used to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkContext is requested to SparkContext.md#checkpointFile[checkpointFile]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • == [[checkpointPartitionerFileName]] Checkpointed Partitioner File

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ReliableCheckpointRDD uses _partitioner as the name of the file in the <> with the <> serialized to.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[partitioner]] Partitioner

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ReliableCheckpointRDD can be given a rdd:Partitioner.md[Partitioner] to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When rdd:RDD.md#partitioner[requested for the Partitioner] (as an RDD), ReliableCheckpointRDD returns the one it was created with or <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[writeRDDToCheckpointDirectory]] Writing RDD to Checkpoint Directory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"rdd/ReliableCheckpointRDD/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  writeRDDToCheckpointDirectoryT: ClassTag: ReliableCheckpointRDD[T]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  writeRDDToCheckpointDirectory...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  writeRDDToCheckpointDirectory is used when ReliableRDDCheckpointData is requested to rdd:ReliableRDDCheckpointData.md#doCheckpoint[doCheckpoint].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[writePartitionerToCheckpointDir]] Writing Partitioner to Checkpoint Directory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"rdd/ReliableCheckpointRDD/#sourcescala","title":"[source,scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  writePartitionerToCheckpointDir( sc: SparkContext, partitioner: Partitioner, checkpointDirPath: Path): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  writePartitionerToCheckpointDir creates the <> with the buffer size based on configuration-properties.md#spark.buffer.size[spark.buffer.size] configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  writePartitionerToCheckpointDir requests the core:SparkEnv.md#serializer[default Serializer] for a new serializer:Serializer.md#newInstance[SerializerInstance].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  writePartitionerToCheckpointDir requests the SerializerInstance to serializer:SerializerInstance.md#serializeStream[serialize the output stream] and serializer:DeserializationStream.md#writeObject[writes] the given Partitioner.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, writePartitionerToCheckpointDir prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"rdd/ReliableCheckpointRDD/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableCheckpointRDD/#written-partitioner-to-partitionerfilepath","title":"Written partitioner to [partitionerFilePath]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In case of any non-fatal exception, writePartitionerToCheckpointDir prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"rdd/ReliableCheckpointRDD/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableCheckpointRDD/#error-writing-partitioner-partitioner-to-checkpointdirpath","title":"Error writing partitioner [partitioner] to [checkpointDirPath]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  writePartitionerToCheckpointDir is used when ReliableCheckpointRDD is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[readCheckpointedPartitionerFile]] Reading Partitioner from Checkpointed Directory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"rdd/ReliableCheckpointRDD/#sourcescala_1","title":"[source,scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  readCheckpointedPartitionerFile( sc: SparkContext, checkpointDirPath: String): Option[Partitioner]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  readCheckpointedPartitionerFile opens the <> with the buffer size based on configuration-properties.md#spark.buffer.size[spark.buffer.size] configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  readCheckpointedPartitionerFile requests the core:SparkEnv.md#serializer[default Serializer] for a new serializer:Serializer.md#newInstance[SerializerInstance].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  readCheckpointedPartitionerFile requests the SerializerInstance to serializer:SerializerInstance.md#deserializeStream[deserialize the input stream] and serializer:DeserializationStream.md#readObject[read the Partitioner] from the partitioner file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  readCheckpointedPartitionerFile prints out the following DEBUG message to the logs and returns the partitioner.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"rdd/ReliableCheckpointRDD/#sourceplaintext_2","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableCheckpointRDD/#read-partitioner-from-partitionerfilepath","title":"Read partitioner from [partitionerFilePath]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In case of FileNotFoundException or any non-fatal exceptions, readCheckpointedPartitionerFile prints out a corresponding message to the logs and returns None.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  readCheckpointedPartitionerFile is used when ReliableCheckpointRDD is requested for the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[logging]] Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Enable ALL logging level for org.apache.spark.rdd.ReliableCheckpointRDD$ logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"rdd/ReliableCheckpointRDD/#sourceplaintext_3","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableCheckpointRDD/#log4jloggerorgapachesparkrddreliablecheckpointrddall","title":"log4j.logger.org.apache.spark.rdd.ReliableCheckpointRDD$=ALL","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Refer to spark-logging.md[Logging].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"rdd/ReliableRDDCheckpointData/","title":"ReliableRDDCheckpointData","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ReliableRDDCheckpointData is a RDDCheckpointData for Reliable Checkpointing.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"rdd/ReliableRDDCheckpointData/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ReliableRDDCheckpointData takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[rdd]] rdd:RDD.md[++RDD[T]++]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ReliableRDDCheckpointData is created for rdd:RDD.md#checkpoint[RDD.checkpoint] operator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[cpDir]][[checkpointPath]] Checkpoint Directory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ReliableRDDCheckpointData creates a subdirectory of the SparkContext.md#checkpointDir[application-wide checkpoint directory] for <> the given <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The name of the subdirectory uses the rdd:RDD.md#id[unique identifier] of the <>:"},{"location":"rdd/ReliableRDDCheckpointData/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableRDDCheckpointData/#rdd-id","title":"rdd-[id]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[doCheckpoint]] Checkpointing RDD

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"rdd/ReliableRDDCheckpointData/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/ReliableRDDCheckpointData/#docheckpoint-checkpointrddt","title":"doCheckpoint(): CheckpointRDD[T]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  doCheckpoint rdd:ReliableCheckpointRDD.md#writeRDDToCheckpointDirectory[writes] the <> to the <> (that creates a new RDD).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  With configuration-properties.md#spark.cleaner.referenceTracking.cleanCheckpoints[spark.cleaner.referenceTracking.cleanCheckpoints] configuration property enabled, doCheckpoint requests the SparkContext.md#cleaner[ContextCleaner] to core:ContextCleaner.md#registerRDDCheckpointDataForCleanup[registerRDDCheckpointDataForCleanup] for the new RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, doCheckpoint prints out the following INFO message to the logs and returns the new RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"rdd/ReliableRDDCheckpointData/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableRDDCheckpointData/#done-checkpointing-rdd-id-to-cpdir-new-parent-is-rdd-id","title":"Done checkpointing RDD [id] to [cpDir], new parent is RDD [id]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  doCheckpoint is part of the rdd:RDDCheckpointData.md#doCheckpoint[RDDCheckpointData] abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"rdd/ShuffleDependency/","title":"ShuffleDependency","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ShuffleDependency is a Dependency on the output of a ShuffleMapStage of a key-value RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ShuffleDependency uses the RDD to know the number of (map-side/pre-shuffle) partitions and the Partitioner for the number of (reduce-size/post-shuffle) partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ShuffleDependency takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • RDD (RDD[_ <: Product2[K, V]])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Partitioner
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Serializer (default: SparkEnv.get.serializer)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Optional Key Ordering (default: undefined)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Optional Aggregator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • mapSideCombine
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ShuffleWriteProcessor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ShuffleDependency is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • CoGroupedRDD is requested for the dependencies (for RDDs with different partitioners)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffledRDD is requested for the dependencies
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleExchangeExec (Spark SQL) physical operator is requested to prepare a ShuffleDependency

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    When created, ShuffleDependency gets the shuffle id.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleDependency registers itself with the ShuffleManager and gets a ShuffleHandle (available as shuffleHandle). ShuffleDependency uses SparkEnv to access the ShuffleManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In the end, ShuffleDependency registers itself with the ContextCleaner (if configured) and the ShuffleDriverComponents.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#aggregator","title":"Aggregator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    aggregator: Option[Aggregator[K, V, C]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleDependency can be given a map/reduce-side Aggregator when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleDependency asserts (when created) that an Aggregator is defined when the mapSideCombine flag is enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    aggregator\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SortShuffleWriter is requested to write records (for mapper tasks)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BlockStoreShuffleReader is requested to read records (for reducer tasks)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#map-size-partial-aggregation-flag","title":"Map-Size Partial Aggregation Flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleDependency uses a mapSideCombine flag that controls whether to perform map-side partial aggregation (map-side combine) using the Aggregator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    mapSideCombine is disabled (false) by default and can be enabled (true) for some uses of ShuffledRDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleDependency requires that the optional Aggregator is actually defined for the flag enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    mapSideCombine is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BlockStoreShuffleReader is requested to read combined records for a reduce task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SortShuffleManager is requested to register a shuffle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SortShuffleWriter is requested to write records
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#partitioner","title":"Partitioner

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleDependency is given a Partitioner (when created).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleDependency uses the Partitioner to partition the shuffle output.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The Partitioner is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SortShuffleWriter is requested to write records (and create an ExternalSorter)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • others (FIXME)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#shufflewriteprocessor","title":"ShuffleWriteProcessor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleDependency can be given a ShuffleWriteProcessor when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The ShuffleWriteProcessor is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleMapTask is requested to runTask (to write partition records out to the shuffle system)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#shuffle-id","title":"Shuffle ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    shuffleId: Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleDependency is identified uniquely by an application-wide shuffle ID (that is requested from SparkContext when created).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#shufflehandle","title":"ShuffleHandle

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleDependency registers itself with the ShuffleManager when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The ShuffleHandle is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • CoGroupedRDDs, ShuffledRDD, and ShuffledRowRDD (Spark SQL) are requested to compute a partition (to get a ShuffleReader for a ShuffleDependency)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleMapTask is requested to run (to get a ShuffleWriter for a ShuffleDependency).
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/","title":"ShuffledRDD","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffledRDD is an RDD of key-value pairs that represents a shuffle step in a RDD lineage (and indicates start of a new stage).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    When requested to compute a partition, ShuffledRDD uses the one and only ShuffleDependency for a ShuffleHandle for a ShuffleReader (from the system ShuffleManager) that is used to read the (combined) key-value pairs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffledRDD takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • RDD (of K keys and V values)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Partitioner
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffledRDD is created\u00a0for the following RDD operators:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • OrderedRDDFunctions.sortByKey and OrderedRDDFunctions.repartitionAndSortWithinPartitions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • PairRDDFunctions.combineByKeyWithClassTag and PairRDDFunctions.partitionBy

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • RDD.coalesce (with shuffle flag enabled)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#partitioner","title":"Partitioner

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ShuffledRDD is given a Partitioner when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • RangePartitioner for sortByKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • HashPartitioner for coalesce
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Whatever passed in to the following high-level RDD operators when different from the current Partitioner (of the RDD):
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • repartitionAndSortWithinPartitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • combineByKeyWithClassTag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • partitionBy

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The given Partitioner is the partitioner of this ShuffledRDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The Partitioner is also used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • getDependencies (to create the only ShuffleDependency)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • getPartitions (to create as many ShuffledRDDPartitions as the numPartitions of the Partitioner)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#dependencies","title":"Dependencies Signature
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getDependencies: Seq[Dependency[_]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getDependencies is part of the RDD abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getDependencies uses the user-specified Serializer, if defined, or requests the current SerializerManager for one.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getDependencies uses the mapSideCombine internal flag for the types of the keys and values (i.e. K and C or K and V when the flag is enabled or not, respectively).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In the end, getDependencies creates a single ShuffleDependency (with the previous RDD, the Partitioner, and the Serializer).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#computing-partition","title":"Computing Partition Signature
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      compute(\n  split: Partition,\n  context: TaskContext): Iterator[(K, C)]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      compute is part of the RDD abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      compute assumes that ShuffleDependency is the first dependency among the dependencies (and the only one per getDependencies).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      compute uses the SparkEnv to access the ShuffleManager. compute requests the ShuffleManager for the ShuffleReader based on the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ShuffleReader Value ShuffleHandle ShuffleHandle of the ShuffleDependency startPartition The index of the given split partition endPartition index + 1

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In the end, compute requests the ShuffleReader to read the (combined) key-value pairs (of type (K, C)).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#key-value-and-combiner-types","title":"Key, Value and Combiner Types
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      class ShuffledRDD[K: ClassTag, V: ClassTag, C: ClassTag]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ShuffledRDD is given an RDD of K keys and V values to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      When computed, ShuffledRDD produces pairs of K keys and C values.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#isbarrier-flag","title":"isBarrier Flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ShuffledRDD has isBarrier flag always disabled (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#map-side-combine-flag","title":"Map-Side Combine Flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ShuffledRDD uses a map-side combine flag to create a ShuffleDependency when requested for the dependencies (there is always only one).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The flag is disabled (false) by default and can be changed using setMapSideCombine method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      setMapSideCombine(\n  mapSideCombine: Boolean): ShuffledRDD[K, V, C]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      setMapSideCombine is used for PairRDDFunctions.combineByKeyWithClassTag transformation (which defaults to the flag enabled).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#placement-preferences-of-partition","title":"Placement Preferences of Partition Signature
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getPreferredLocations(\n  partition: Partition): Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getPreferredLocations is part of the RDD abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getPreferredLocations requests MapOutputTrackerMaster for the preferred locations of the given partition (BlockManagers with the most map outputs).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getPreferredLocations uses SparkEnv to access the current MapOutputTrackerMaster.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#shuffledrddpartition","title":"ShuffledRDDPartition

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ShuffledRDDPartition gets an index to be created (that in turn is the index of partitions as calculated by the Partitioner).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#user-specified-serializer","title":"User-Specified Serializer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      User-specified Serializer for the single ShuffleDependency dependency

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      userSpecifiedSerializer: Option[Serializer] = None\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      userSpecifiedSerializer is undefined (None) by default and can be changed using setSerializer method (that is used for PairRDDFunctions.combineByKeyWithClassTag transformation).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#demos","title":"Demos","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#shuffledrdd-and-coalesce","title":"ShuffledRDD and coalesce
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      val data = sc.parallelize(0 to 9)\nval coalesced = data.coalesce(numPartitions = 4, shuffle = true)\nscala> println(coalesced.toDebugString)\n(4) MapPartitionsRDD[9] at coalesce at <pastie>:75 []\n |  CoalescedRDD[8] at coalesce at <pastie>:75 []\n |  ShuffledRDD[7] at coalesce at <pastie>:75 []\n +-(16) MapPartitionsRDD[6] at coalesce at <pastie>:75 []\n    |   ParallelCollectionRDD[5] at parallelize at <pastie>:74 []\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#shuffledrdd-and-sortbykey","title":"ShuffledRDD and sortByKey
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      val data = sc.parallelize(0 to 9)\nval grouped = rdd.groupBy(_ % 2)\nval sorted = grouped.sortByKey(numPartitions = 2)\nscala> println(sorted.toDebugString)\n(2) ShuffledRDD[15] at sortByKey at <console>:74 []\n +-(4) ShuffledRDD[12] at groupBy at <console>:74 []\n    +-(4) MapPartitionsRDD[11] at groupBy at <console>:74 []\n       |  MapPartitionsRDD[9] at coalesce at <pastie>:75 []\n       |  CoalescedRDD[8] at coalesce at <pastie>:75 []\n       |  ShuffledRDD[7] at coalesce at <pastie>:75 []\n       +-(16) MapPartitionsRDD[6] at coalesce at <pastie>:75 []\n          |   ParallelCollectionRDD[5] at parallelize at <pastie>:74 []\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":"","tags":["DeveloperApi"]},{"location":"rdd/checkpointing/","title":"RDD Checkpointing","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      RDD Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      There are two types of checkpointing:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • <> - RDD checkpointing that saves the actual intermediate RDD data to a reliable distributed file system (e.g. Hadoop DFS)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • <> - RDD checkpointing that saves the data to a local file system

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        It's up to a Spark application developer to decide when and how to checkpoint using RDD.checkpoint() method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Before checkpointing is used, a Spark developer has to set the checkpoint directory using SparkContext.setCheckpointDir(directory: String) method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[reliable-checkpointing]] Reliable Checkpointing

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        You call SparkContext.setCheckpointDir(directory: String) to set the checkpoint directory - the directory where RDDs are checkpointed. The directory must be a HDFS path if running on a cluster. The reason is that the driver may attempt to reconstruct the checkpointed RDD from its own local file system, which is incorrect because the checkpoint files are actually on the executor machines.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        You mark an RDD for checkpointing by calling RDD.checkpoint(). The RDD will be saved to a file inside the checkpoint directory and all references to its parent RDDs will be removed. This function has to be called before any job has been executed on this RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: It is strongly recommended that a checkpointed RDD is persisted in memory, otherwise saving it on a file will require recomputation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When an action is called on a checkpointed RDD, the following INFO message is printed out in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Done checkpointing RDD 5 to [path], new parent is RDD [id]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[local-checkpointing]] Local Checkpointing

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        localCheckpoint allows to truncate RDD lineage graph while skipping the expensive step of replicating the materialized data to a reliable distributed file system.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        This is useful for RDDs with long lineages that need to be truncated periodically, e.g. GraphX.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Local checkpointing trades fault-tolerance for performance.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: The checkpoint directory set through SparkContext.setCheckpointDir is not used.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[demo]] Demo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/checkpointing/#sourceplaintext","title":"[source,plaintext]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        val rdd = sc.parallelize(0 to 9)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> rdd.checkpoint org.apache.spark.SparkException: Checkpoint directory has not been set in the SparkContext at org.apache.spark.rdd.RDD.checkpoint(RDD.scala:1599) ... 49 elided

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        sc.setCheckpointDir(\"/tmp/rdd-checkpoint\")

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        // Creates a subdirectory for this SparkContext $ ls /tmp/rdd-checkpoint/ fc21e1d1-3cd9-4d51-880f-58d1dd07f783

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        // Mark the RDD to checkpoint at the earliest action rdd.checkpoint

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> println(rdd.getCheckpointFile) Some(file:/tmp/rdd-checkpoint/fc21e1d1-3cd9-4d51-880f-58d1dd07f783/rdd-2)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> println(ns.id) 2

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> println(rdd.getNumPartitions) 16

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        rdd.count

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        // Check out the checkpoint directory // You should find a directory for the checkpointed RDD, e.g. rdd-2 // The number of part-000* files is exactly the number of partitions $ ls -ltra /tmp/rdd-checkpoint/fc21e1d1-3cd9-4d51-880f-58d1dd07f783/rdd-2/part-000* | wc -l 16

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/lineage/","title":"RDD Lineage \u2014 Logical Execution Plan","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        RDD Lineage (RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of an RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        RDD lineage is built as a result of applying transformations to an RDD and creates a so-called logical execution plan.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The execution DAG or physical execution plan is the DAG of stages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The above RDD graph could be the result of the following series of transformations:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        val r00 = sc.parallelize(0 to 9)\nval r01 = sc.parallelize(0 to 90 by 10)\nval r10 = r00.cartesian(r01)\nval r11 = r00.map(n => (n, n))\nval r12 = r00.zip(r01)\nval r13 = r01.keyBy(_ / 20)\nval r20 = Seq(r11, r12, r13).foldLeft(r10)(_ union _)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/lineage/#logical-execution-plan","title":"Logical Execution Plan","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Logical Execution Plan starts with the earliest RDDs (those with no dependencies on other RDDs or reference cached data) and ends with the RDD that produces the result of the action that has been called to execute.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        A logical plan (a DAG) is materialized and executed when SparkContext is requested to run a Spark job.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-actions/","title":"Actions","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        RDD Actions are RDD operations that produce concrete non-RDD values. They materialize a value in a Spark program. In other words, a RDD operation that returns a value of any type but RDD[T] is an action.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        action: RDD => a value\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: Actions are synchronous. You can use <> to release a calling thread while calling actions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        They trigger execution of <> to return values. Simply put, an action evaluates the RDD lineage graph.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        You can think of actions as a valve and until action is fired, the data to be processed is not even in the pipes, i.e. transformations. Only actions can materialize the entire processing pipeline with real data.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • aggregate
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • collect
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • count
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • countApprox*
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • countByValue*
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • first
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • fold
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • foreach
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • foreachPartition
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • max
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • min
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • reduce
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • saveAs* (e.g. saveAsTextFile, saveAsHadoopFile)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • take
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • takeOrdered
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • takeSample
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • toLocalIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • top
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • treeAggregate
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • treeReduce

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Actions run jobs using SparkContext.runJob or directly DAGScheduler.runJob.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> :type words\n\nscala> words.count  // <1>\nres0: Long = 502\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TIP: You should cache RDDs you work with when you want to execute two or more actions on it for a better performance. Refer to spark-rdd-caching.md[RDD Caching and Persistence].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Before calling an action, Spark does closure/function cleaning (using SparkContext.clean) to make it ready for serialization and sending over the wire to executors. Cleaning can throw a SparkException if the computation cannot be cleaned.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: Spark uses ClosureCleaner to clean closures.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[AsyncRDDActions]] AsyncRDDActions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        AsyncRDDActions class offers asynchronous actions that you can use on RDDs (thanks to the implicit conversion rddToAsyncRDDActions in RDD class). The methods return a <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The following asynchronous methods are available:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • countAsync
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • collectAsync
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • takeAsync
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • foreachAsync
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • foreachPartitionAsync
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-caching/","title":"Caching and Persistence","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == RDD Caching and Persistence

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Caching or persistence are optimisation techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storages like disk and/or replicated.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        RDDs can be cached using <> operation. They can also be persisted using <> operation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The difference between cache and persist operations is purely syntactic. cache is a synonym of persist or persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: Due to the very small and purely syntactic difference between caching and persistence of RDDs the two terms are often used interchangeably and I will follow the \"pattern\" here.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        RDDs can also be <> to remove RDD from a permanent storage like memory and/or disk.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[cache]] Caching RDD -- cache Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-caching/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-caching/#cache-thistype-persist","title":"cache(): this.type = persist()","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        cache is a synonym of <> with storage:StorageLevel.md[MEMORY_ONLY storage level].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[persist]] Persisting RDD -- persist Methods

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-caching/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        persist(): this.type persist(newLevel: StorageLevel): this.type

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        persist marks a RDD for persistence using newLevel storage:StorageLevel.md[storage level].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        You can only change the storage level once or persist reports an UnsupportedOperationException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Cannot change storage level of an RDD after it was already assigned a level\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: You can pretend to change the storage level of an RDD with already-assigned storage level only if the storage level is the same as it is currently assigned.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        If the RDD is marked as persistent the first time, the RDD is core:ContextCleaner.md#registerRDDForCleanup[registered to ContextCleaner] (if available) and SparkContext.md#persistRDD[SparkContext].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The internal storageLevel attribute is set to the input newLevel storage level.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[unpersist]] Unpersisting RDDs (Clearing Blocks) -- unpersist Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-caching/#source-scala_2","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-caching/#unpersistblocking-boolean-true-thistype","title":"unpersist(blocking: Boolean = true): this.type","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When called, unpersist prints the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        INFO [RddName]: Removing RDD [id] from persistence list\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        It then calls SparkContext.md#unpersist[SparkContext.unpersistRDD(id, blocking)] and sets storage:StorageLevel.md[NONE storage level] as the current storage level.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-operations/","title":"Operators","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == Operators - Transformations and Actions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        RDDs have two types of operations: spark-rdd-transformations.md[transformations] and spark-rdd-actions.md[actions].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: Operators are also called operations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === Gotchas - things to watch for

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Even if you don't access it explicitly it cannot be referenced inside a closure as it is serialized and carried around across executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        See https://issues.apache.org/jira/browse/SPARK-5063

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-partitions/","title":"Partitions and Partitioning","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == Partitions and Partitioning

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === Introduction

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Depending on how you look at Spark (programmer, devop, admin), an RDD is about the content (developer's and data scientist's perspective) or how it gets spread out over a cluster (performance), i.e. how many partitions an RDD represents.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        A partition (aka split) is a logical chunk of a large distributed data set.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-partitions/#caution","title":"[CAUTION]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        1. How does the number of partitions map to the number of tasks? How to verify it?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark manages data using partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        There is a one-to-one correspondence between how data is laid out in data storage like HDFS or Cassandra (it is partitioned for the same reasons).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Features:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • number
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • partitioning scheme
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • node distribution
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • repartitioning
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-partitions/#how-does-the-mapping-between-partitions-and-tasks-correspond-to-data-locality-if-any","title":"How does the mapping between partitions and tasks correspond to data locality if any?","text":""},{"location":"rdd/spark-rdd-partitions/#tip","title":"[TIP]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Read the following documentations to learn what experts say on the topic:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html[How Many Partitions Does An RDD Have?]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        By default, a partition is created for each HDFS partition, which by default is 64MB (from http://spark.apache.org/docs/latest/programming-guide.html#external-datasets[Spark's Programming Guide]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        RDDs get partitioned automatically without programmer intervention. However, there are times when you'd like to adjust the size and number of partitions or the partitioning scheme according to the needs of your application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        You use def getPartitions: Array[Partition] method on a RDD to know the set of partitions in this RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        As noted in https://github.com/databricks/spark-knowledgebase/blob/master/performance_optimization/how_many_partitions_does_an_rdd_have.md#view-task-execution-against-partitions-using-the-ui[View Task Execution Against Partitions Using the UI]:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When a stage executes, you can see the number of partitions for a given stage in the Spark UI.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Start spark-shell and see it yourself!

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> sc.parallelize(1 to 100).count\nres0: Long = 100\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When you execute the Spark job, i.e. sc.parallelize(1 to 100).count, you should see the following in http://localhost:4040/jobs[Spark shell application UI].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        .The number of partition as Total tasks in UI image::spark-partitions-ui-stages.png[align=\"center\"]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The reason for 8 Tasks in Total is that I'm on a 8-core laptop and by default the number of partitions is the number of all available cores.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ sysctl -n hw.ncpu\n8\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        You can request for the minimum number of partitions, using the second input parameter to many transformations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> sc.parallelize(1 to 100, 2).count\nres1: Long = 100\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        .Total tasks in UI shows 2 partitions image::spark-partitions-ui-stages-2-partitions.png[align=\"center\"]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        You can always ask for the number of partitions using partitions method of a RDD:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> val ints = sc.parallelize(1 to 100, 4)\nints: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24\n\nscala> ints.partitions.size\nres2: Int = 4\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In general, smaller/more numerous partitions allow work to be distributed among more workers, but larger/fewer partitions allow work to be done in larger chunks, which may result in the work getting done more quickly as long as all workers are kept busy, due to reduced overhead.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Increasing partitions count will make each partition to have less data (or not at all!)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Spark can only run 1 concurrent task for every partition of an RDD, up to the number of cores in your cluster. So if you have a cluster with 50 cores, you want your RDDs to at least have 50 partitions (and probably http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism[2-3x times that]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        As far as choosing a \"good\" number of partitions, you generally want at least as many as the number of executors for parallelism. You can get this computed value by calling sc.defaultParallelism.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Also, the number of partitions determines how many files get generated by actions that save RDDs to files.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The maximum size of a partition is ultimately limited by the available memory of an executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the first RDD transformation, e.g. reading from a file using sc.textFile(path, partition), the partition parameter will be applied to all further transformations and actions on this RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Partitions get redistributed among nodes whenever shuffle occurs. Repartitioning may cause shuffle to occur in some situations, but it is not guaranteed to occur in all cases. And it usually happens during action stage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When creating an RDD by reading a file using rdd = SparkContext().textFile(\"hdfs://.../file.txt\") the number of partitions may be smaller. Ideally, you would get the same number of blocks as you see in HDFS, but if the lines in your file are too long (longer than the block size), there will be fewer partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Preferred way to set up the number of partitions for an RDD is to directly pass it as the second input parameter in the call like rdd = sc.textFile(\"hdfs://.../file.txt\", 400), where 400 is the number of partitions. In this case, the partitioning makes for 400 splits that would be done by the Hadoop's TextInputFormat, not Spark and it would work much faster. It's also that the code spawns 400 concurrent tasks to try to load file.txt directly into 400 partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        It will only work as described for uncompressed files.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When using textFile with compressed files (file.txt.gz not file.txt or similar), Spark disables splitting that makes for an RDD with only 1 partition (as reads against gzipped files cannot be parallelized). In this case, to change the number of partitions you should do <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Some operations, e.g. map, flatMap, filter, don't preserve partitioning.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        map, flatMap, filter operations apply a function to every partition.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[repartitioning]][[repartition]] Repartitioning RDD -- repartition Transformation

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-partitions/#httpssparkapacheorgdocslatesttuninghtmltuning-spark-the-official-documentation-of-spark","title":"https://spark.apache.org/docs/latest/tuning.html[Tuning Spark] (the official documentation of Spark)","text":""},{"location":"rdd/spark-rdd-partitions/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-partitions/#repartitionnumpartitions-intimplicit-ord-orderingt-null-rddt","title":"repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        repartition is <> with numPartitions and shuffle enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        With the following computation you can see that repartition(5) causes 5 tasks to be started using NODE_LOCAL data locality.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> lines.repartition(5).count\n...\n15/10/07 08:10:00 INFO DAGScheduler: Submitting 5 missing tasks from ResultStage 7 (MapPartitionsRDD[19] at repartition at <console>:27)\n15/10/07 08:10:00 INFO TaskSchedulerImpl: Adding task set 7.0 with 5 tasks\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 17, localhost, partition 0,NODE_LOCAL, 2089 bytes)\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 1.0 in stage 7.0 (TID 18, localhost, partition 1,NODE_LOCAL, 2089 bytes)\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 2.0 in stage 7.0 (TID 19, localhost, partition 2,NODE_LOCAL, 2089 bytes)\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 3.0 in stage 7.0 (TID 20, localhost, partition 3,NODE_LOCAL, 2089 bytes)\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 4.0 in stage 7.0 (TID 21, localhost, partition 4,NODE_LOCAL, 2089 bytes)\n...\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        You can see a change after executing repartition(1) causes 2 tasks to be started using PROCESS_LOCAL data locality.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> lines.repartition(1).count\n...\n15/10/07 08:14:09 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 8 (MapPartitionsRDD[20] at repartition at <console>:27)\n15/10/07 08:14:09 INFO TaskSchedulerImpl: Adding task set 8.0 with 2 tasks\n15/10/07 08:14:09 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 22, localhost, partition 0,PROCESS_LOCAL, 2058 bytes)\n15/10/07 08:14:09 INFO TaskSetManager: Starting task 1.0 in stage 8.0 (TID 23, localhost, partition 1,PROCESS_LOCAL, 2058 bytes)\n...\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Please note that Spark disables splitting for compressed files and creates RDDs with only 1 partition. In such cases, it's helpful to use sc.textFile('demo.gz') and do repartitioning using rdd.repartition(100) as follows:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        rdd = sc.textFile('demo.gz')\nrdd = rdd.repartition(100)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        With the lines, you end up with rdd to be exactly 100 partitions of roughly equal in size.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • rdd.repartition(N) does a shuffle to split data to match N ** partitioning is done on round robin basis

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TIP: If partitioning scheme doesn't work for you, you can write your own custom partitioner.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TIP: It's useful to get familiar with https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html[Hadoop's TextInputFormat].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[coalesce]] coalesce Transformation

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-partitions/#source-scala_1","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-partitions/#coalescenumpartitions-int-shuffle-boolean-falseimplicit-ord-orderingt-null-rddt","title":"coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The coalesce transformation is used to change the number of partitions. It can trigger shuffling depending on the shuffle flag (disabled by default, i.e. false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the following sample, you parallelize a local 10-number sequence and coalesce it first without and then with shuffling (note the shuffle parameter being false and true, respectively).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Tip

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Use toDebugString to check out the RDD lineage graph.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> val rdd = sc.parallelize(0 to 10, 8)\nrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24\n\nscala> rdd.partitions.size\nres0: Int = 8\n\nscala> rdd.coalesce(numPartitions=8, shuffle=false)   // <1>\nres1: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[1] at coalesce at <console>:27\n\nscala> res1.toDebugString\nres2: String =\n(8) CoalescedRDD[1] at coalesce at <console>:27 []\n |  ParallelCollectionRDD[0] at parallelize at <console>:24 []\n\nscala> rdd.coalesce(numPartitions=8, shuffle=true)\nres3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at coalesce at <console>:27\n\nscala> res3.toDebugString\nres4: String =\n(8) MapPartitionsRDD[5] at coalesce at <console>:27 []\n |  CoalescedRDD[4] at coalesce at <console>:27 []\n |  ShuffledRDD[3] at coalesce at <console>:27 []\n +-(8) MapPartitionsRDD[2] at coalesce at <console>:27 []\n    |  ParallelCollectionRDD[0] at parallelize at <console>:24 []\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <1> shuffle is false by default and it's explicitly used here for demo purposes. Note the number of partitions that remains the same as the number of partitions in the source RDD rdd.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-transformations/","title":"Transformations -- Lazy Operations on RDD (to Create One or More RDDs)","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Transformations are lazy operations on an rdd:RDD.md[RDD] that create one or many new RDDs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        // T and U are Scala types\ntransformation: RDD[T] => RDD[U]\ntransformation: RDD[T] => Seq[RDD[U]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In other words, transformations are functions that take an RDD as the input and produce one or many RDDs as the output. Transformations do not change the input RDD (since rdd:index.md#introduction[RDDs are immutable] and hence cannot be modified), but produce one or more new RDDs by applying the computations they represent.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        [[methods]] .(Subset of) RDD Transformations (Public API) [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | aggregate a| [[aggregate]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-transformations/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        aggregateU( seqOp: (U, T) => U, combOp: (U, U) => U): U

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | barrier a| [[barrier]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-transformations/#source-scala_1","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#barrier-rddbarriert","title":"barrier(): RDDBarrier[T]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        (New in 2.4.0) Marks the current stage as a <> in <>, where Spark must launch all tasks together

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Internally, barrier creates a <> over the RDD

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | cache a| [[cache]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-transformations/#source-scala_2","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#cache-thistype","title":"cache(): this.type","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Persists the RDD with the storage:StorageLevel.md#MEMORY_ONLY[MEMORY_ONLY] storage level

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Synonym of <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | coalesce a| [[coalesce]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-transformations/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        coalesce( numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty) (implicit ord: Ordering[T] = null): RDD[T]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | filter a| [[filter]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-transformations/#source-scala_4","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#filterf-t-boolean-rddt","title":"filter(f: T => Boolean): RDD[T]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | flatMap a| [[flatMap]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-transformations/#source-scala_5","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#flatmapu-rddu","title":"flatMapU: RDD[U]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | map a| [[map]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-transformations/#source-scala_6","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#mapu-rddu","title":"mapU: RDD[U]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | mapPartitions a| [[mapPartitions]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-transformations/#source-scala_7","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        mapPartitionsU: RDD[U]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | mapPartitionsWithIndex a| [[mapPartitionsWithIndex]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-transformations/#source-scala_8","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        mapPartitionsWithIndexU: RDD[U]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | randomSplit a| [[randomSplit]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-transformations/#source-scala_9","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        randomSplit( weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | union a| [[union]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-transformations/#source-scala_10","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ++(other: RDD[T]): RDD[T] union(other: RDD[T]): RDD[T]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | persist a| [[persist]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-transformations/#source-scala_11","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        persist(): this.type persist(newLevel: StorageLevel): this.type

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        By applying transformations you incrementally build a RDD lineage with all the parent RDDs of the final RDD(s).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Transformations are lazy, i.e. are not executed immediately. Only after calling an action are transformations executed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        After executing a transformation, the result RDD(s) will always be different from their parents and can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap, union, cartesian) or the same size (e.g. map).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        CAUTION: There are transformations that may trigger jobs, e.g. sortBy, <>, etc.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        .From SparkContext by transformations to the result image::rdd-sparkcontext-transformations-action.png[align=\"center\"]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Certain transformations can be pipelined which is an optimization that Spark uses to improve performance of computations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"rdd/spark-rdd-transformations/#sourcescala","title":"[source,scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> val file = sc.textFile(\"README.md\") file: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[54] at textFile at :24

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> val allWords = file.flatMap(_.split(\"\\W+\")) allWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[55] at flatMap at :26

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> val words = allWords.filter(!_.isEmpty) words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at filter at :28

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> val pairs = words.map((_,1)) pairs: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[57] at map at :30

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> val reducedByKey = pairs.reduceByKey(_ + _) reducedByKey: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[59] at reduceByKey at :32

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> val top10words = reducedByKey.takeOrdered(10)(Ordering[Int].reverse.on(_._2)) INFO SparkContext: Starting job: takeOrdered at :34 ... INFO DAGScheduler: Job 18 finished: takeOrdered at :34, took 0.074386 s top10words: Array[(String, Int)] = Array((the,21), (to,14), (Spark,13), (for,11), (and,10), (##,8), (a,8), (run,7), (can,6), (is,6))

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        There are two kinds of transformations:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • <>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[narrow-transformations]] Narrow Transformations

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Narrow transformations are the result of map, filter and such that is from the data from a single partition only, i.e. it is self-sustained.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          An output RDD has partitions with records that originate from a single partition in the parent RDD. Only a limited subset of partitions used to calculate the result.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Spark groups narrow transformations as a stage which is called pipelining.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[wide-transformations]] Wide Transformations

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Wide transformations are the result of groupByKey and reduceByKey. The data required to compute the records in a single partition may reside in many partitions of the parent RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: Wide transformations are also called shuffle transformations as they may or may not depend on a shuffle.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          All of the tuples with the same key must end up in the same partition, processed by the same task. To satisfy these operations, Spark must execute a RDD shuffle, which transfers data across cluster and results in a new stage with a new set of partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rdd/spark-rdd-transformations/#zipwithindex","title":"zipWithIndex","text":""},{"location":"rdd/spark-rdd-transformations/#source-scala_12","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#zipwithindex-rddt-long","title":"zipWithIndex(): RDD[(T, Long)]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          zipWithIndex zips this RDD[T] with its element indices.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"rdd/spark-rdd-transformations/#caution","title":"[CAUTION]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          If the number of partitions of the source RDD is greater than 1, it will submit an additional job to calculate start indices.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rdd/spark-rdd-transformations/#source-scala_13","title":"[source, scala]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          val onePartition = sc.parallelize(0 to 9, 1)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          scala> onePartition.partitions.length res0: Int = 1

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          // no job submitted onePartition.zipWithIndex

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          val eightPartitions = sc.parallelize(0 to 9, 8)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          scala> eightPartitions.partitions.length res1: Int = 8

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          // submits a job eightPartitions.zipWithIndex

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          .Spark job submitted by zipWithIndex transformation image::spark-transformations-zipWithIndex-webui.png[align=\"center\"] ====

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"rest/","title":"Index","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          = Status REST API -- Monitoring Spark Applications Using REST API

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Status REST API is a collection of REST endpoints under /api/v1 URI path in the spark-api-UIRoot.md[root containers for application UI information]:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[SparkUI]] spark-webui-SparkUI.md[SparkUI] - Application UI for an active Spark application (i.e. a Spark application that is still running)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[HistoryServer]] spark-history-server:HistoryServer.md[HistoryServer] - Application UI for active and completed Spark applications (i.e. Spark applications that are still running or have already finished)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Status REST API uses spark-api-ApiRootResource.md[ApiRootResource] main resource class that registers /api/v1 URI <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[paths]] .URI Paths [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Path | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [[applications]] applications | [[ApplicationListResource]] Delegates to the spark-api-ApplicationListResource.md[ApplicationListResource] resource class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [[applications_appId]] applications/\\{appId} | [[OneApplicationResource]] Delegates to the spark-api-OneApplicationResource.md[OneApplicationResource] resource class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [[version]] version | Creates a VersionInfo with the current version of Spark |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Status REST API uses the following components:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • https://jersey.github.io/[Jersey RESTful Web Services framework] with support for the https://github.com/jax-rs[Java API for RESTful Web Services] (JAX-RS API)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • https://www.eclipse.org/jetty/[Eclipse Jetty] as the lightweight HTTP server and the https://jcp.org/en/jsr/detail?id=369[Java Servlet] container

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/AbstractApplicationResource/","title":"AbstractApplicationResource","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[AbstractApplicationResource]] AbstractApplicationResource

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          AbstractApplicationResource is a spark-api-BaseAppResource.md[BaseAppResource] with a set of <> that are common across <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          // start spark-shell\n$ http http://localhost:4040/api/v1/applications\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 257\nContent-Type: application/json\nDate: Tue, 05 Jun 2018 18:46:32 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[\n    {\n        \"attempts\": [\n            {\n                \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n                \"completed\": false,\n                \"duration\": 0,\n                \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n                \"endTimeEpoch\": -1,\n                \"lastUpdated\": \"2018-06-05T15:04:48.328GMT\",\n                \"lastUpdatedEpoch\": 1528211088328,\n                \"sparkUser\": \"jacek\",\n                \"startTime\": \"2018-06-05T15:04:48.328GMT\",\n                \"startTimeEpoch\": 1528211088328\n            }\n        ],\n        \"id\": \"local-1528211089216\",\n        \"name\": \"Spark shell\"\n    }\n]\n\n$ http http://localhost:4040/api/v1/applications/local-1528211089216/storage/rdd\nHTTP/1.1 200 OK\nContent-Length: 3\nContent-Type: application/json\nDate: Tue, 05 Jun 2018 18:48:00 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[]\n\n// Execute the following query in spark-shell\nspark.range(5).cache.count\n\n$ http http://localhost:4040/api/v1/applications/local-1528211089216/storage/rdd\n// output omitted for brevity\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[implementations]] .AbstractApplicationResources [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | AbstractApplicationResource | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | spark-api-OneApplicationResource.md[OneApplicationResource] | [[OneApplicationResource]] Handles applications/appId requests

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | spark-api-OneApplicationAttemptResource.md[OneApplicationAttemptResource] | [[OneApplicationAttemptResource]] |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[paths]] .AbstractApplicationResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | allexecutors | GET | <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | environment | GET | <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | executors | GET | <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | jobs | GET | <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | jobs/{jobId: \\\\d+} | GET | <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | logs | GET | <> stages <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | storage/rdd/{rddId: \\\\d+} | GET | <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [[storage_rdd]] storage/rdd | GET | <> |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[rddList]] rddList Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/AbstractApplicationResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#rddlist-seqrddstorageinfo","title":"rddList(): Seq[RDDStorageInfo]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          rddList...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: rddList is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[environmentInfo]] environmentInfo Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/AbstractApplicationResource/#source-scala_1","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#environmentinfo-applicationenvironmentinfo","title":"environmentInfo(): ApplicationEnvironmentInfo","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          environmentInfo...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: environmentInfo is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[rddData]] rddData Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/AbstractApplicationResource/#source-scala_2","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#rdddatapathparamrddid-rddid-int-rddstorageinfo","title":"rddData(@PathParam(\"rddId\") rddId: Int): RDDStorageInfo","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          rddData...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: rddData is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[allExecutorList]] allExecutorList Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/AbstractApplicationResource/#source-scala_3","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#allexecutorlist-seqexecutorsummary","title":"allExecutorList(): Seq[ExecutorSummary]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          allExecutorList...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: allExecutorList is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[executorList]] executorList Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/AbstractApplicationResource/#source-scala_4","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#executorlist-seqexecutorsummary","title":"executorList(): Seq[ExecutorSummary]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          executorList...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: executorList is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[oneJob]] oneJob Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/AbstractApplicationResource/#source-scala_5","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#onejobpathparamjobid-jobid-int-jobdata","title":"oneJob(@PathParam(\"jobId\") jobId: Int): JobData","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          oneJob...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: oneJob is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[jobsList]] jobsList Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/AbstractApplicationResource/#source-scala_6","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#jobslistqueryparamstatus-statuses-jlistjobexecutionstatus-seqjobdata","title":"jobsList(@QueryParam(\"status\") statuses: JList[JobExecutionStatus]): Seq[JobData]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          jobsList...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: jobsList is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/ApiRequestContext/","title":"ApiRequestContext","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[ApiRequestContext]] ApiRequestContext

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ApiRequestContext is the <> of...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[contract]] [source, scala]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          package org.apache.spark.status.api.v1

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          trait ApiRequestContext { // only required methods that have no implementation // the others follow @Context var servletContext: ServletContext = _

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          @Context var httpRequest: HttpServletRequest = _ }

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: ApiRequestContext is a private[v1] contract.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          .ApiRequestContext Contract [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | httpRequest | [[httpRequest]] Java Servlets' HttpServletRequest

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | servletContext | [[servletContext]] Java Servlets' ServletContext

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when...FIXME |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[implementations]] .ApiRequestContexts [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | ApiRequestContext | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | spark-api-ApiRootResource.md[ApiRootResource] | [[ApiRootResource]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | ApiStreamingApp | [[ApiStreamingApp]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | spark-api-ApplicationListResource.md[ApplicationListResource] | [[ApplicationListResource]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | spark-api-BaseAppResource.md[BaseAppResource] | [[BaseAppResource]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | SecurityFilter | [[SecurityFilter]] |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[uiRoot]] Getting Current UIRoot -- uiRoot Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/ApiRequestContext/#source-scala","title":"[source, scala]","text":""},{"location":"rest/ApiRequestContext/#uiroot-uiroot","title":"uiRoot: UIRoot","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          uiRoot simply requests UIRootFromServletContext to spark-api-UIRootFromServletContext.md#getUiRoot[get the current UIRoot] (for the given <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: uiRoot is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/ApiRootResource/","title":"ApiRootResource","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[ApiRootResource]] ApiRootResource -- /api/v1 URI Handler

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ApiRootResource is the spark-api-ApiRequestContext.md[ApiRequestContext] for the /v1 URI path.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ApiRootResource uses @Path(\"/v1\") annotation at the class level. It is a partial URI path template relative to the base URI of the server on which the resource is deployed, the context root of the application, and the URL pattern to which the JAX-RS runtime responds.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          TIP: Learn more about @Path annotation in https://docs.oracle.com/cd/E19798-01/821-1841/6nmq2cp26/index.html[The @Path Annotation and URI Path Templates].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ApiRootResource <> the /api/* context handler (with the REST resources and providers in org.apache.spark.status.api.v1 package).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          With the @Path(\"/v1\") annotation and after <> the /api/* context handler, ApiRootResource serves HTTP requests for <> under the /api/v1 URI paths for spark-webui-SparkUI.md#initialize[SparkUI] and spark-history-server:HistoryServer.md#initialize[HistoryServer].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ApiRootResource gives the metrics of a Spark application in JSON format (using JAX-RS API).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          // start spark-shell\n$ http http://localhost:4040/api/v1/applications\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 257\nContent-Type: application/json\nDate: Tue, 05 Jun 2018 18:36:16 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[\n    {\n        \"attempts\": [\n            {\n                \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n                \"completed\": false,\n                \"duration\": 0,\n                \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n                \"endTimeEpoch\": -1,\n                \"lastUpdated\": \"2018-06-05T15:04:48.328GMT\",\n                \"lastUpdatedEpoch\": 1528211088328,\n                \"sparkUser\": \"jacek\",\n                \"startTime\": \"2018-06-05T15:04:48.328GMT\",\n                \"startTimeEpoch\": 1528211088328\n            }\n        ],\n        \"id\": \"local-1528211089216\",\n        \"name\": \"Spark shell\"\n    }\n]\n\n// Fixed in Spark 2.3.1\n// https://issues.apache.org/jira/browse/SPARK-24188\n$ http http://localhost:4040/api/v1/version\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 43\nContent-Type: application/json\nDate: Thu, 14 Jun 2018 08:19:06 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n{\n    \"spark\": \"2.3.1\"\n}\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[paths]] .ApiRootResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[applications]] applications [[ApplicationListResource]] Delegates to the spark-api-ApplicationListResource.md[ApplicationListResource] resource class [[applications_appId]] applications/\\{appId} [[OneApplicationResource]] Delegates to the spark-api-OneApplicationResource.md[OneApplicationResource] resource class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [[version]] version | GET | Creates a VersionInfo with the current version of Spark |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[getServletHandler]] Creating /api/* Context Handler -- getServletHandler Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/ApiRootResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/ApiRootResource/#getservlethandleruiroot-uiroot-servletcontexthandler","title":"getServletHandler(uiRoot: UIRoot): ServletContextHandler","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getServletHandler creates a Jetty ServletContextHandler for /api context path.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: The Jetty ServletContextHandler created does not support HTTP sessions as REST API is stateless.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getServletHandler creates a Jetty ServletHolder with the resources and providers in org.apache.spark.status.api.v1 package. It then registers the ServletHolder to serve /* context path (under the ServletContextHandler for /api).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getServletHandler requests UIRootFromServletContext to spark-api-UIRootFromServletContext.md#setUiRoot[setUiRoot] with the ServletContextHandler and the input spark-api-UIRoot.md[UIRoot].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: getServletHandler is used when spark-webui-SparkUI.md#initialize[SparkUI] and spark-history-server:HistoryServer.md#initialize[HistoryServer] are requested to initialize.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/ApplicationListResource/","title":"ApplicationListResource","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[ApplicationListResource]] ApplicationListResource -- applications URI Handler

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ApplicationListResource is a spark-api-ApiRequestContext.md[ApiRequestContext] that spark-api-ApiRootResource.md#applications[ApiRootResource] uses to handle <> URI path.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[paths]] .ApplicationListResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [[root]] / | GET | <> |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          // start spark-shell\n// there should be a single Spark application -- the spark-shell itself\n$ http http://localhost:4040/api/v1/applications\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 255\nContent-Type: application/json\nDate: Wed, 06 Jun 2018 12:40:33 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[\n    {\n        \"attempts\": [\n            {\n                \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n                \"completed\": false,\n                \"duration\": 0,\n                \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n                \"endTimeEpoch\": -1,\n                \"lastUpdated\": \"2018-06-06T12:30:19.220GMT\",\n                \"lastUpdatedEpoch\": 1528288219220,\n                \"sparkUser\": \"jacek\",\n                \"startTime\": \"2018-06-06T12:30:19.220GMT\",\n                \"startTimeEpoch\": 1528288219220\n            }\n        ],\n        \"id\": \"local-1528288219790\",\n        \"name\": \"Spark shell\"\n    }\n]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[isAttemptInRange]] isAttemptInRange Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/ApplicationListResource/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          isAttemptInRange( attempt: ApplicationAttemptInfo, minStartDate: SimpleDateParam, maxStartDate: SimpleDateParam, minEndDate: SimpleDateParam, maxEndDate: SimpleDateParam, anyRunning: Boolean): Boolean

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          isAttemptInRange...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: isAttemptInRange is used exclusively when ApplicationListResource is requested to handle a <> HTTP request.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[appList]] appList Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/ApplicationListResource/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          appList( @QueryParam(\"status\") status: JList[ApplicationStatus], @DefaultValue(\"2010-01-01\") @QueryParam(\"minDate\") minDate: SimpleDateParam, @DefaultValue(\"3000-01-01\") @QueryParam(\"maxDate\") maxDate: SimpleDateParam, @DefaultValue(\"2010-01-01\") @QueryParam(\"minEndDate\") minEndDate: SimpleDateParam, @DefaultValue(\"3000-01-01\") @QueryParam(\"maxEndDate\") maxEndDate: SimpleDateParam, @QueryParam(\"limit\") limit: Integer) : Iterator[ApplicationInfo]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          appList...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: appList is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/BaseAppResource/","title":"BaseAppResource","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[BaseAppResource]] BaseAppResource

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          BaseAppResource is the contract of spark-api-ApiRequestContext.md[ApiRequestContexts] that can <> and use <> and <> path parameters in URI paths.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[path-params]] .BaseAppResource's Path Parameters [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | appId | [[appId]] @PathParam(\"appId\")

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | attemptId | [[attemptId]] @PathParam(\"attemptId\")

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when...FIXME |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[implementations]] .BaseAppResources [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | BaseAppResource | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | spark-api-AbstractApplicationResource.md[AbstractApplicationResource] | [[AbstractApplicationResource]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | BaseStreamingAppResource | [[BaseStreamingAppResource]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | spark-api-StagesResource.md[StagesResource] | [[StagesResource]] |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: BaseAppResource is a private[v1] contract.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[withUI]] withUI Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/BaseAppResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/BaseAppResource/#withuit-t","title":"withUIT: T","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          withUI...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: withUI is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/OneApplicationAttemptResource/","title":"OneApplicationAttemptResource","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[OneApplicationAttemptResource]] OneApplicationAttemptResource

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          OneApplicationAttemptResource is a spark-api-AbstractApplicationResource.md[AbstractApplicationResource] (and so a spark-api-ApiRequestContext.md[ApiRequestContext] indirectly).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          OneApplicationAttemptResource is used when AbstractApplicationResource is requested to spark-api-AbstractApplicationResource.md#applicationAttempt[applicationAttempt].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[paths]] .OneApplicationAttemptResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [[root]] / | GET | <> |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          // start spark-shell\n// there should be a single Spark application -- the spark-shell itself\n// CAUTION: FIXME Demo of OneApplicationAttemptResource in Action\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[getAttempt]] getAttempt Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/OneApplicationAttemptResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/OneApplicationAttemptResource/#getattempt-applicationattemptinfo","title":"getAttempt(): ApplicationAttemptInfo","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getAttempt requests the spark-api-ApiRequestContext.md#uiRoot[UIRoot] for the spark-api-UIRoot.md#getApplicationInfo[application info] (given the spark-api-BaseAppResource.md#appId[appId]) and finds the spark-api-BaseAppResource.md#attemptId[attemptId] among the available attempts.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: spark-api-BaseAppResource.md#appId[appId] and spark-api-BaseAppResource.md#attemptId[attemptId] are path parameters.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In the end, getAttempt returns the ApplicationAttemptInfo if available or reports a NotFoundException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          unknown app [appId], attempt [attemptId]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/OneApplicationResource/","title":"OneApplicationResource","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[OneApplicationResource]] OneApplicationResource -- applications/appId URI Handler

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          OneApplicationResource is a spark-api-AbstractApplicationResource.md[AbstractApplicationResource] (and so a spark-api-ApiRequestContext.md[ApiRequestContext] indirectly) that spark-api-ApiRootResource.md#applications_appId[ApiRootResource] uses to handle <> URI path.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[paths]] .OneApplicationResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [[root]] / | GET | <> |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          // start spark-shell\n// there should be a single Spark application -- the spark-shell itself\n$ http http://localhost:4040/api/v1/applications\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 255\nContent-Type: application/json\nDate: Wed, 06 Jun 2018 12:40:33 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[\n    {\n        \"attempts\": [\n            {\n                \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n                \"completed\": false,\n                \"duration\": 0,\n                \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n                \"endTimeEpoch\": -1,\n                \"lastUpdated\": \"2018-06-06T12:30:19.220GMT\",\n                \"lastUpdatedEpoch\": 1528288219220,\n                \"sparkUser\": \"jacek\",\n                \"startTime\": \"2018-06-06T12:30:19.220GMT\",\n                \"startTimeEpoch\": 1528288219220\n            }\n        ],\n        \"id\": \"local-1528288219790\",\n        \"name\": \"Spark shell\"\n    }\n]\n\n$ http http://localhost:4040/api/v1/applications/local-1528288219790\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 255\nContent-Type: application/json\nDate: Wed, 06 Jun 2018 12:41:43 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n{\n    \"attempts\": [\n        {\n            \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n            \"completed\": false,\n            \"duration\": 0,\n            \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n            \"endTimeEpoch\": -1,\n            \"lastUpdated\": \"2018-06-06T12:30:19.220GMT\",\n            \"lastUpdatedEpoch\": 1528288219220,\n            \"sparkUser\": \"jacek\",\n            \"startTime\": \"2018-06-06T12:30:19.220GMT\",\n            \"startTimeEpoch\": 1528288219220\n        }\n    ],\n    \"id\": \"local-1528288219790\",\n    \"name\": \"Spark shell\"\n}\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[getApp]] getApp Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/OneApplicationResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/OneApplicationResource/#getapp-applicationinfo","title":"getApp(): ApplicationInfo","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getApp requests the spark-api-ApiRequestContext.md#uiRoot[UIRoot] for the spark-api-UIRoot.md#getApplicationInfo[application info] (given the spark-api-BaseAppResource.md#appId[appId]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In the end, getApp returns the ApplicationInfo if available or reports a NotFoundException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          unknown app: [appId]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/StagesResource/","title":"StagesResource","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[StagesResource]] StagesResource

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          StagesResource is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[paths]] .StagesResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | | GET | <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | {stageId: \\d+} | GET | <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | {stageId: \\d+}/{stageAttemptId: \\d+} | GET | <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | {stageId: \\d+}/{stageAttemptId: \\d+}/taskSummary | GET | <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | {stageId: \\d+}/{stageAttemptId: \\d+}/taskList | GET | <> |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[stageList]] stageList Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/StagesResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/StagesResource/#stagelistqueryparamstatus-statuses-jliststagestatus-seqstagedata","title":"stageList(@QueryParam(\"status\") statuses: JList[StageStatus]): Seq[StageData]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          stageList...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: stageList is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[stageData]] stageData Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/StagesResource/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          stageData( @PathParam(\"stageId\") stageId: Int, @QueryParam(\"details\") @DefaultValue(\"true\") details: Boolean): Seq[StageData]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          stageData...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: stageData is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[oneAttemptData]] oneAttemptData Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/StagesResource/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          oneAttemptData( @PathParam(\"stageId\") stageId: Int, @PathParam(\"stageAttemptId\") stageAttemptId: Int, @QueryParam(\"details\") @DefaultValue(\"true\") details: Boolean): StageData

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          oneAttemptData...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: oneAttemptData is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[taskSummary]] taskSummary Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/StagesResource/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          taskSummary( @PathParam(\"stageId\") stageId: Int, @PathParam(\"stageAttemptId\") stageAttemptId: Int, @DefaultValue(\"0.05,0.25,0.5,0.75,0.95\") @QueryParam(\"quantiles\") quantileString: String) : TaskMetricDistributions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          taskSummary...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: taskSummary is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[taskList]] taskList Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/StagesResource/#source-scala_4","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          taskList( @PathParam(\"stageId\") stageId: Int, @PathParam(\"stageAttemptId\") stageAttemptId: Int, @DefaultValue(\"0\") @QueryParam(\"offset\") offset: Int, @DefaultValue(\"20\") @QueryParam(\"length\") length: Int, @DefaultValue(\"ID\") @QueryParam(\"sortBy\") sortBy: TaskSorting): Seq[TaskData]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          taskList...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: taskList is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/UIRoot/","title":"UIRoot","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[UIRoot]] UIRoot -- Contract for Root Contrainers of Application UI Information

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          UIRoot is the <> of the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[contract]] [source, scala]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          package org.apache.spark.status.api.v1

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          trait UIRoot { // only required methods that have no implementation // the others follow def withSparkUIT(fn: SparkUI => T): T def getApplicationInfoList: Iterator[ApplicationInfo] def getApplicationInfo(appId: String): Option[ApplicationInfo] def securityManager: SecurityManager }

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: UIRoot is a private[spark] contract.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          .UIRoot Contract [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | getApplicationInfo | [[getApplicationInfo]] Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | getApplicationInfoList | [[getApplicationInfoList]] Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | securityManager | [[securityManager]] Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | withSparkUI | [[withSparkUI]] Used exclusively when BaseAppResource is requested spark-api-BaseAppResource.md#withUI[withUI] |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[implementations]] .UIRoots [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | UIRoot | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | spark-history-server:HistoryServer.md[HistoryServer] | [[HistoryServer]] Application UI for active and completed Spark applications (i.e. Spark applications that are still running or have already finished)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | spark-webui-SparkUI.md[SparkUI] | [[SparkUI]] Application UI for an active Spark application (i.e. a Spark application that is still running) |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[writeEventLogs]] writeEventLogs Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/UIRoot/#source-scala","title":"[source, scala]","text":""},{"location":"rest/UIRoot/#writeeventlogsappid-string-attemptid-optionstring-zipstream-zipoutputstream-unit","title":"writeEventLogs(appId: String, attemptId: Option[String], zipStream: ZipOutputStream): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          writeEventLogs...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: writeEventLogs is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/UIRootFromServletContext/","title":"UIRootFromServletContext","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[UIRootFromServletContext]] UIRootFromServletContext

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          UIRootFromServletContext manages the current <> object in a Jetty ContextHandler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[attribute]] UIRootFromServletContext uses its canonical name for the context attribute that is used to <> or <> the current spark-api-UIRoot.md[UIRoot] object (in Jetty's ContextHandler).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: https://www.eclipse.org/jetty/javadoc/current/org/eclipse/jetty/server/handler/ContextHandler.html[ContextHandler] is the environment for multiple Jetty Handlers, e.g. URI context path, class loader, static resource base.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In essence, UIRootFromServletContext is simply a \"bridge\" between two worlds, Spark's spark-api-UIRoot.md[UIRoot] and Jetty's ContextHandler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[setUiRoot]] setUiRoot Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/UIRootFromServletContext/#source-scala","title":"[source, scala]","text":""},{"location":"rest/UIRootFromServletContext/#setuirootcontexthandler-contexthandler-uiroot-uiroot-unit","title":"setUiRoot(contextHandler: ContextHandler, uiRoot: UIRoot): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          setUiRoot...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: setUiRoot is used exclusively when ApiRootResource is requested to spark-api-ApiRootResource.md#getServletHandler[register /api/* context handler].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[getUiRoot]] getUiRoot Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rest/UIRootFromServletContext/#source-scala_1","title":"[source, scala]","text":""},{"location":"rest/UIRootFromServletContext/#getuirootcontext-servletcontext-uiroot","title":"getUiRoot(context: ServletContext): UIRoot","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getUiRoot...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: getUiRoot is used exclusively when ApiRequestContext is requested for the current spark-api-ApiRequestContext.md#uiRoot[UIRoot].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rpc/","title":"RPC System","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          RPC System is a communication system of Spark services.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The main abstractions are RpcEnv and RpcEndpoint.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rpc/NettyRpcEnv/","title":"NettyRpcEnv","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NettyRpcEnv is an RpcEnv that uses Netty (\"an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers & clients\").

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"rpc/NettyRpcEnv/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NettyRpcEnv takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • JavaSerializerInstance
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Host Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SecurityManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Number of CPU Cores

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NettyRpcEnv is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • NettyRpcEnvFactory is requested to create an RpcEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"rpc/NettyRpcEnvFactory/","title":"NettyRpcEnvFactory","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NettyRpcEnvFactory is an RpcEnvFactory for a Netty-based RpcEnv.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"rpc/NettyRpcEnvFactory/#creating-rpcenv","title":"Creating RpcEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            create(\n  config: RpcEnvConfig): RpcEnv\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            create creates a JavaSerializerInstance (using a JavaSerializer).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            KryoSerializer is not supported.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            create creates a rpc:NettyRpcEnv.md[] with the JavaSerializerInstance. create uses the given rpc:RpcEnvConfig.md[] for the rpc:RpcEnvConfig.md#advertiseAddress[advertised address], rpc:RpcEnvConfig.md#securityManager[SecurityManager] and rpc:RpcEnvConfig.md#numUsableCores[number of CPU cores].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            create returns the NettyRpcEnv unless the rpc:RpcEnvConfig.md#clientMode[clientMode] is turned off (server mode).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In server mode, create attempts to start the NettyRpcEnv on a given port. create uses the given rpc:RpcEnvConfig.md[] for the rpc:RpcEnvConfig.md#port[port], rpc:RpcEnvConfig.md#bindAddress[bind address], and rpc:RpcEnvConfig.md#name[name]. With the port, the NettyRpcEnv is requested to rpc:NettyRpcEnv.md#startServer[start a server].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            create is part of the rpc:RpcEnvFactory.md#create[RpcEnvFactory] abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"rpc/RpcAddress/","title":"RpcAddress","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            RpcAddress is a logical address of an RPC system, with hostname and port.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            RpcAddress can be encoded as a Spark URL in the format of spark://host:port.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"rpc/RpcAddress/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            RpcAddress takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Host
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Port"},{"location":"rpc/RpcAddress/#creating-rpcaddress-based-on-spark-url","title":"Creating RpcAddress based on Spark URL
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              fromSparkURL(\n  sparkUrl: String): RpcAddress\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              fromSparkURL extract a host and a port from the input Spark URL and creates an RpcAddress.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              fromSparkURL\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • StandaloneAppClient (Spark Standalone) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ClientApp (Spark Standalone) is requested to start
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Worker (Spark Standalone) is requested to startRpcEnvAndEndpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rpc/RpcEndpoint/","title":"RpcEndpoint","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RpcEndpoint is an abstraction of RPC endpoints that are registered to an RpcEnv to process one- (fire-and-forget) or two-way messages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"rpc/RpcEndpoint/#contract","title":"Contract","text":""},{"location":"rpc/RpcEndpoint/#onconnected","title":"onConnected
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              onConnected(\n  remoteAddress: RpcAddress): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Invoked when RpcAddress is connected to the current node

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Inbox is requested to process a RemoteProcessConnected message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rpc/RpcEndpoint/#ondisconnected","title":"onDisconnected
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              onDisconnected(\n  remoteAddress: RpcAddress): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Inbox is requested to process a RemoteProcessDisconnected message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rpc/RpcEndpoint/#onerror","title":"onError
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              onError(\n  cause: Throwable): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Inbox is requested to process a message that threw a NonFatal exception
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rpc/RpcEndpoint/#onnetworkerror","title":"onNetworkError
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              onNetworkError(\n  cause: Throwable,\n  remoteAddress: RpcAddress): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Inbox is requested to process a RemoteProcessConnectionError message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rpc/RpcEndpoint/#onstart","title":"onStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              onStart(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Inbox is requested to process an OnStart message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rpc/RpcEndpoint/#onstop","title":"onStop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              onStop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Inbox is requested to process an OnStop message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rpc/RpcEndpoint/#processing-one-way-messages","title":"Processing One-Way Messages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              receive: PartialFunction[Any, Unit]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Inbox is requested to process an OneWayMessage message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rpc/RpcEndpoint/#processing-two-way-messages","title":"Processing Two-Way Messages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              receiveAndReply(\n  context: RpcCallContext): PartialFunction[Any, Unit]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Inbox is requested to process a RpcMessage message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rpc/RpcEndpoint/#rpcenv","title":"RpcEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              rpcEnv: RpcEnv\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              RpcEnv this RpcEndpoint is registered to

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"rpc/RpcEndpoint/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • AMEndpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • IsolatedRpcEndpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • MapOutputTrackerMasterEndpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • OutputCommitCoordinatorEndpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • RpcEndpointVerifier
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ThreadSafeRpcEndpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • WorkerWatcher
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • "},{"location":"rpc/RpcEndpoint/#self","title":"self
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                self: RpcEndpointRef\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                self requests the RpcEnv for the RpcEndpointRef of this RpcEndpoint.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                self throws an IllegalArgumentException when the RpcEnv has not been initialized:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                rpcEnv has not been initialized\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"rpc/RpcEndpoint/#stopping-rpcendpoint","title":"Stopping RpcEndpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                stop requests the RpcEnv to stop this RpcEndpoint

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"rpc/RpcEndpointAddress/","title":"RpcEndpointAddress","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                = RpcEndpointAddress

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RpcEndpointAddress is a logical address of an endpoint in an RPC system, with <> and name.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RpcEndpointAddress is in the format of spark://[name]@[rpcAddress.host]:[rpcAddress.port].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"rpc/RpcEndpointRef/","title":"RpcEndpointRef","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RpcEndpointRef is a reference to a rpc:RpcEndpoint.md[RpcEndpoint] in a rpc:index.md[RpcEnv].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RpcEndpointRef is a serializable entity and so you can send it over a network or save it for later use (it can however be deserialized using the owning RpcEnv only).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                A RpcEndpointRef has <> (a Spark URL), and a name.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                You can send asynchronous one-way messages to the corresponding RpcEndpoint using <> method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                You can send a semi-synchronous message, i.e. \"subscribe\" to be notified when a response arrives, using ask method. You can also block the current calling thread for a response using askWithRetry method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • spark.rpc.numRetries (default: 3) - the number of times to retry connection attempts.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • spark.rpc.retry.wait (default: 3s) - the number of milliseconds to wait on each retry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                It also uses rpc:index.md#endpoint-lookup-timeout[lookup timeouts].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                == [[send]] send Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                == [[askWithRetry]] askWithRetry Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"rpc/RpcEnv/","title":"RpcEnv","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RpcEnv is an abstraction of RPC environments.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"rpc/RpcEnv/#contract","title":"Contract","text":""},{"location":"rpc/RpcEnv/#address","title":"address
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                address: RpcAddress\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RpcAddress of this RPC environments

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"rpc/RpcEnv/#asyncsetupendpointrefbyuri","title":"asyncSetupEndpointRefByURI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                asyncSetupEndpointRefByURI(\n  uri: String): Future[RpcEndpointRef]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Looking up a RpcEndpointRef of the RPC endpoint by URI (asynchronously)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • WorkerWatcher is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • CoarseGrainedExecutorBackend is requested to onStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • RpcEnv is requested to setupEndpointRefByURI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"rpc/RpcEnv/#awaittermination","title":"awaitTermination
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                awaitTermination(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Blocks the current thread till the RPC environment terminates

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkEnv is requested to stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ClientApp (Spark Standalone) is requested to start
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • LocalSparkCluster (Spark Standalone) is requested to stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Master (Spark Standalone) and Worker (Spark Standalone) are launched
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • CoarseGrainedExecutorBackend is requested to run
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"rpc/RpcEnv/#deserialize","title":"deserialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                deserialize[T](\n  deserializationAction: () => T): T\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • PersistenceEngine is requested to readPersistedData
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • NettyRpcEnv is requested to deserialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"rpc/RpcEnv/#endpointref","title":"endpointRef
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                endpointRef(\n  endpoint: RpcEndpoint): RpcEndpointRef\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • RpcEndpoint is requested for the RpcEndpointRef to itself
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"rpc/RpcEnv/#rpcenvfileserver","title":"RpcEnvFileServer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                fileServer: RpcEnvFileServer\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RpcEnvFileServer of this RPC environment

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkContext is requested to addFile, addJar and is created (and registers the REPL's output directory)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"rpc/RpcEnv/#openchannel","title":"openChannel
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                openChannel(\n  uri: String): ReadableByteChannel\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Opens a channel to download a file at the given URI

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Utils utility is used to doFetchFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExecutorClassLoader is requested to getClassFileInputStreamFromSparkRPC
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"rpc/RpcEnv/#setupendpoint","title":"setupEndpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                setupEndpoint(\n  name: String,\n  endpoint: RpcEndpoint): RpcEndpointRef\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"rpc/RpcEnv/#shutdown","title":"shutdown
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                shutdown(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Shuts down this RPC environment asynchronously (and to make sure this RpcEnv exits successfully, use awaitTermination)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkEnv is requested to stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • LocalSparkCluster (Spark Standalone) is requested to stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • DriverWrapper is launched
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • CoarseGrainedExecutorBackend is launched
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • NettyRpcEnvFactory is requested to create an RpcEnv (in server mode and failed to assign a port)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"rpc/RpcEnv/#stopping-rpcendpointref","title":"Stopping RpcEndpointRef
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                stop(\n  endpoint: RpcEndpointRef): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkContext is requested to stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • RpcEndpoint is requested to stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • in Spark SQL
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"rpc/RpcEnv/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • NettyRpcEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"rpc/RpcEnv/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RpcEnv takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkConf

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  RpcEnv is created using RpcEnv.create utility.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  RpcEnv\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete RpcEnvs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"rpc/RpcEnv/#creating-rpcenv","title":"Creating RpcEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  create(\n  name: String,\n  host: String,\n  port: Int,\n  conf: SparkConf,\n  securityManager: SecurityManager,\n  clientMode: Boolean = false): RpcEnv // (1)\ncreate(\n  name: String,\n  bindAddress: String,\n  advertiseAddress: String,\n  port: Int,\n  conf: SparkConf,\n  securityManager: SecurityManager,\n  numUsableCores: Int,\n  clientMode: Boolean): RpcEnv\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1. Uses 0 for numUsableCores

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  create creates a NettyRpcEnvFactory and requests it to create an RpcEnv (with a new RpcEnvConfig with all the given arguments).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  create is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkEnv utility is requested to create a SparkEnv (clientMode flag is turned on for executors and off for the driver)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • With clientMode flag true:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • CoarseGrainedExecutorBackend is requested to run
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ClientApp (Spark Standalone) is requested to start
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Master (Spark Standalone) is requested to startRpcEnvAndEndpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Worker (Spark Standalone) is requested to startRpcEnvAndEndpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • DriverWrapper is launched
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ApplicationMaster (Spark on YARN) is requested to runExecutorLauncher (in client deploy mode)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"rpc/RpcEnv/#default-endpoint-lookup-timeout","title":"Default Endpoint Lookup Timeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  RpcEnv uses the default lookup timeout for...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When a remote endpoint is resolved, a local RPC environment connects to the remote one (endpoint lookup). To configure the time needed for the endpoint lookup you can use the following settings.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  It is a prioritized list of lookup timeout properties (the higher on the list, the more important):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • spark.rpc.lookupTimeout
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • spark.network.timeout
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"rpc/RpcEnvConfig/","title":"RpcEnvConfig","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  = RpcEnvConfig :page-toclevels: -1

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[creating-instance]] RpcEnvConfig is a configuration of an rpc:RpcEnv.md[]:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[conf]] SparkConf.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[name]] System Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[bindAddress]] Bind Address
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[advertiseAddress]] Advertised Address
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[port]] Port
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[securityManager]] SecurityManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[numUsableCores]] Number of CPU cores
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    RpcEnvConfig is created when RpcEnv utility is used to rpc:RpcEnv.md#create[create an RpcEnv] (using rpc:RpcEnvFactory.md[]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    == [[clientMode]] Client Mode

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    When an RPC Environment is initialized core:SparkEnv.md#createDriverEnv[as part of the initialization of the driver] or core:SparkEnv.md#createExecutorEnv[executors] (using RpcEnv.create), clientMode is false for the driver and true for executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Copied (almost verbatim) from https://issues.apache.org/jira/browse/SPARK-10997[SPARK-10997 Netty-based RPC env should support a \"client-only\" mode] and the https://github.com/apache/spark/commit/71d1c907dec446db566b19f912159fd8f46deb7d[commit]:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    \"Client mode\" means the RPC env will not listen for incoming connections.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    This allows certain processes in the Spark stack (such as Executors or tha YARN client-mode AM) to act as pure clients when using the netty-based RPC backend, reducing the number of sockets Spark apps need to use and also the number of open ports.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The AM connects to the driver in \"client mode\", and that connection is used for all driver -- AM communication, and so the AM is properly notified when the connection goes down.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In \"general\", non-YARN case, clientMode flag is therefore enabled for executors and disabled for the driver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In Spark on YARN in client deploy mode, clientMode flag is however enabled explicitly when Spark on YARN's spark-yarn-applicationmaster.md#runExecutorLauncher-sparkYarnAM[ApplicationMaster] creates the sparkYarnAM RPC Environment.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"rpc/RpcEnvFactory/","title":"RpcEnvFactory","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    = RpcEnvFactory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    RpcEnvFactory is an abstraction of <> to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    == [[implementations]] Available RpcEnvFactories

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    rpc:NettyRpcEnvFactory.md[] is the default and only known RpcEnvFactory in Apache Spark (as of https://github.com/apache/spark/commit/4f5a24d7e73104771f233af041eeba4f41675974[this commit]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    == [[create]] Creating RpcEnv

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"rpc/RpcEnvFactory/#sourcescala","title":"[source,scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    create( config: RpcEnvConfig): RpcEnv

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    create is used when RpcEnv utility is requested to rpc:RpcEnv.md#create[create an RpcEnv].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"rpc/RpcEnvFileServer/","title":"RpcEnvFileServer","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    = RpcEnvFileServer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    RpcEnvFileServer is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"rpc/RpcUtils/","title":"RpcUtils","text":""},{"location":"rpc/RpcUtils/#maximum-message-size","title":"Maximum Message Size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    maxMessageSizeBytes(\n  conf: SparkConf): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    maxMessageSizeBytes is the value of spark.rpc.message.maxSize configuration property in bytes (by multiplying the value by 1024 * 1024).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    maxMessageSizeBytes throws an IllegalArgumentException when the value is above 2047 MB:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    spark.rpc.message.maxSize should not be greater than 2047 MB\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    maxMessageSizeBytes is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • MapOutputTrackerMaster is requested for the maxRpcMessageSize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Executor is requested for the maxDirectResultSize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • CoarseGrainedSchedulerBackend is requested for the maxRpcMessageSize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"rpc/RpcUtils/#makedriverref","title":"makeDriverRef
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    makeDriverRef(\n  name: String,\n  conf: SparkConf,\n  rpcEnv: RpcEnv): RpcEndpointRef\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    makeDriverRef...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    makeDriverRef is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BarrierTaskContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkEnv utility is used to create a SparkEnv (on executors)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Executor is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • PluginContextImpl is requested for driverEndpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"rpc/spark-rpc-netty/","title":"Netty-Based RpcEnv","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Netty-based RPC Environment is created by NettyRpcEnvFactory when rpc:index.md#settings[spark.rpc] is netty or org.apache.spark.rpc.netty.NettyRpcEnvFactory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NettyRpcEnv is only started on spark-driver.md[the driver]. See <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The default port to listen to is 7077.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    When NettyRpcEnv starts, the following INFO message is printed out in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Successfully started service 'NettyRpcEnv' on port 0.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    == [[thread-pools]] Thread Pools

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    === shuffle-server-ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    EventLoopGroup uses a daemon thread pool called shuffle-server-ID, where ID is a unique integer for NioEventLoopGroup (NIO) or EpollEventLoopGroup (EPOLL) for the Shuffle server.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    CAUTION: FIXME Review Netty's NioEventLoopGroup.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    CAUTION: FIXME Where are SO_BACKLOG, SO_RCVBUF, SO_SNDBUF channel options used?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    === dispatcher-event-loop-ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NettyRpcEnv's Dispatcher uses the daemon fixed thread pool with <> threads.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Thread names are formatted as dispatcher-event-loop-ID, where ID is a unique, sequentially assigned integer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    It starts the message processing loop on all of the threads.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    === netty-rpc-env-timeout

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NettyRpcEnv uses the daemon single-thread scheduled thread pool netty-rpc-env-timeout.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    \"netty-rpc-env-timeout\" #87 daemon prio=5 os_prio=31 tid=0x00007f887775a000 nid=0xc503 waiting on condition [0x0000000123397000]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    === netty-rpc-connection-ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NettyRpcEnv uses the daemon cached thread pool with up to <> threads.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Thread names are formatted as netty-rpc-connection-ID, where ID is a unique, sequentially assigned integer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    == [[settings]] Settings

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The Netty-based implementation uses the following properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spark.rpc.io.mode (default: NIO) - NIO or EPOLL for low-level IO. NIO is always available, while EPOLL is only available on Linux. NIO uses io.netty.channel.nio.NioEventLoopGroup while EPOLL io.netty.channel.epoll.EpollEventLoopGroup.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spark.shuffle.io.numConnectionsPerPeer always equals 1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spark.rpc.io.threads (default: 0; maximum: 8) - the number of threads to use for the Netty client and server thread pools. ** spark.shuffle.io.serverThreads (default: the value of spark.rpc.io.threads) ** spark.shuffle.io.clientThreads (default: the value of spark.rpc.io.threads)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spark.rpc.netty.dispatcher.numThreads (default: the number of processors available to JVM)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spark.rpc.connect.threads (default: 64) - used in cluster mode to communicate with a remote RPC endpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spark.port.maxRetries (default: 16 or 100 for testing when spark.testing is set) controls the maximum number of binding attempts/retries to a port before giving up.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    == [[endpoints]] Endpoints

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • endpoint-verifier (RpcEndpointVerifier) - a rpc:RpcEndpoint.md[RpcEndpoint] for remote RpcEnvs to query whether an RpcEndpoint exists or not. It uses Dispatcher that keeps track of registered endpoints and responds true/false to CheckExistence message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    endpoint-verifier is used to check out whether a given endpoint exists or not before the endpoint's reference is given back to clients.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    One use case is when an spark-standalone.md#AppClient[AppClient connects to standalone Masters] before it registers the application it acts for.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    CAUTION: FIXME Who'd like to use endpoint-verifier and how?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    == Message Dispatcher

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    A message dispatcher is responsible for routing RPC messages to the appropriate endpoint(s).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    It uses the daemon fixed thread pool dispatcher-event-loop with spark.rpc.netty.dispatcher.numThreads threads for dispatching messages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    \"dispatcher-event-loop-0\" #26 daemon prio=5 os_prio=31 tid=0x00007f8877153800 nid=0x7103 waiting on condition [0x000000011f78b000]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/","title":"Spark Scheduler","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Spark Scheduler is a core component of Apache Spark that is responsible for scheduling tasks for execution.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Spark Scheduler uses the high-level stage-oriented DAGScheduler and the low-level task-oriented TaskScheduler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/#stage-execution","title":"Stage Execution","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Every partition of a Stage is transformed into a Task (ShuffleMapTask or ResultTask for ShuffleMapStage and ResultStage, respectively).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Submitting a stage can therefore trigger execution of a series of dependent parent stages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    When a Spark job is submitted, a new stage is created (they can be created from scratch or linked to, i.e. shared, if other jobs use them already).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DAGScheduler splits up a job into a collection of Stages. A Stage contains a sequence of narrow transformations that can be completed without shuffling data set, separated at shuffle boundaries (where shuffle occurs). Stages are thus a result of breaking the RDD graph at shuffle boundaries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Shuffle boundaries introduce a barrier where stages/tasks must wait for the previous stage to finish before they fetch map outputs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/#resources","title":"Resources","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Deep Dive into the Apache Spark Scheduler by Xingbo Jiang (Databricks)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/ActiveJob/","title":"ActiveJob","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ActiveJob (job, action job) is a top-level work item (computation) submitted to DAGScheduler for execution (usually to compute the result of an RDD action).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Executing a job is equivalent to computing the partitions of the RDD an action has been executed upon. The number of partitions (numPartitions) to compute in a job depends on the type of a stage (ResultStage or ShuffleMapStage).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    A job starts with a single target RDD, but can ultimately include other RDDs that are all part of RDD lineage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The parent stages are always ShuffleMapStages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Not always all partitions have to be computed for ResultStages (e.g. for actions like first() and lookup()).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/ActiveJob/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ActiveJob takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Job ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Final Stage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • CallSite
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • JobListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Properties

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ActiveJob is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • DAGScheduler is requested to handleJobSubmitted and handleMapStageSubmitted
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/ActiveJob/#final-stage","title":"Final Stage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ActiveJob is given a Stage when created that determines a logical type:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      1. Map-Stage Job that computes the map output files for a ShuffleMapStage (for submitMapStage) before any downstream stages are submitted
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      2. Result job that computes a ResultStage to execute an action
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/ActiveJob/#finished-computed-partitions","title":"Finished (Computed) Partitions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ActiveJob uses finished registry of flags to track partitions that have already been computed (true) or not (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/BlacklistTracker/","title":"BlacklistTracker","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      BlacklistTracker is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/CoarseGrainedSchedulerBackend/","title":"CoarseGrainedSchedulerBackend","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      CoarseGrainedSchedulerBackend is a base SchedulerBackend for coarse-grained schedulers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      CoarseGrainedSchedulerBackend is an ExecutorAllocationClient.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      CoarseGrainedSchedulerBackend is responsible for requesting resources from a cluster manager for executors that it in turn uses to launch tasks (on CoarseGrainedExecutorBackend).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      CoarseGrainedSchedulerBackend holds executors for the duration of the Spark job rather than relinquishing executors whenever a task is done and asking the scheduler to launch a new executor for each new task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      CoarseGrainedSchedulerBackend registers CoarseGrainedScheduler RPC Endpoint that executors use for RPC communication.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Active executors are executors that are not pending to be removed or lost.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/CoarseGrainedSchedulerBackend/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • KubernetesClusterSchedulerBackend (Spark on Kubernetes)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • MesosCoarseGrainedSchedulerBackend (Spark on Mesos)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • StandaloneSchedulerBackend (Spark Standalone)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • YarnSchedulerBackend (Spark on YARN)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/CoarseGrainedSchedulerBackend/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      CoarseGrainedSchedulerBackend takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskSchedulerImpl
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • RpcEnv"},{"location":"scheduler/CoarseGrainedSchedulerBackend/#driverEndpoint","title":"CoarseGrainedScheduler RPC Endpoint","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        driverEndpoint: RpcEndpointRef\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        CoarseGrainedSchedulerBackend registers a DriverEndpoint RPC endpoint known as CoarseGrainedScheduler when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/CoarseGrainedSchedulerBackend/#createDriverEndpoint","title":"Creating DriverEndpoint","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        createDriverEndpoint(): DriverEndpoint\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        createDriverEndpoint creates a new DriverEndpoint.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The purpose of createDriverEndpoint is to let CoarseGrainedSchedulerBackends to provide their own custom implementations:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • KubernetesClusterSchedulerBackend (Spark on Kubernetes)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • StandaloneSchedulerBackend

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        createDriverEndpoint is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • CoarseGrainedSchedulerBackend is created (and registers CoarseGrainedScheduler RPC endpoint)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/CoarseGrainedSchedulerBackend/#maxNumConcurrentTasks","title":"Maximum Number of Concurrent Tasks","text":"SchedulerBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        maxNumConcurrentTasks(\n  rp: ResourceProfile): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        maxNumConcurrentTasks is part of the SchedulerBackend abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        maxNumConcurrentTasks uses the Available Executors registry to find out about available ResourceProfiles, total number of CPU cores and ExecutorResourceInfos of every active executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the end, maxNumConcurrentTasks calculates the available (parallel) slots for the given ResourceProfile (and given the available executor resources).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/CoarseGrainedSchedulerBackend/#totalregisteredexecutors-registry","title":"totalRegisteredExecutors Registry
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        totalRegisteredExecutors: AtomicInteger\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        totalRegisteredExecutors is an internal registry of the number of registered executors (a Java AtomicInteger).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        totalRegisteredExecutors starts from 0.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        totalRegisteredExecutors is incremented when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DriverEndpoint is requested to handle a RegisterExecutor message

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        totalRegisteredExecutors is decremented when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DriverEndpoint is requested to remove an executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#sufficient-resources-registered","title":"Sufficient Resources Registered
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        sufficientResourcesRegistered(): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        sufficientResourcesRegistered is true (and is supposed to be overriden by custom CoarseGrainedSchedulerBackends).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#minimum-resources-available-ratio","title":"Minimum Resources Available Ratio
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        minRegisteredRatio: Double\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        minRegisteredRatio is a ratio of the minimum resources available to the total expected resources for the CoarseGrainedSchedulerBackend to be ready for scheduling tasks (for execution).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        minRegisteredRatio uses spark.scheduler.minRegisteredResourcesRatio configuration property if defined or defaults to 0.0.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        minRegisteredRatio can be between 0.0 and 1.0 (inclusive).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        minRegisteredRatio is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • CoarseGrainedSchedulerBackend is requested to isReady
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • StandaloneSchedulerBackend is requested to sufficientResourcesRegistered
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • KubernetesClusterSchedulerBackend is requested to sufficientResourcesRegistered
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MesosCoarseGrainedSchedulerBackend is requested to sufficientResourcesRegistered
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • YarnSchedulerBackend is requested to sufficientResourcesRegistered
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#available-executors-registry","title":"Available Executors Registry
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        executorDataMap: HashMap[String, ExecutorData]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        CoarseGrainedSchedulerBackend tracks available executors using executorDataMap registry (of ExecutorDatas by executor id).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        A new entry is added when DriverEndpoint is requested to handle RegisterExecutor message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        An entry is removed when DriverEndpoint is requested to handle RemoveExecutor message or a remote host (with one or many executors) disconnects.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#revive-messages-scheduler-service","title":"Revive Messages Scheduler Service
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        reviveThread: ScheduledExecutorService\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        CoarseGrainedSchedulerBackend creates a Java ScheduledExecutorService when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The ScheduledExecutorService is used by DriverEndpoint RPC Endpoint to post ReviveOffers messages regularly.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#maximum-size-of-rpc-message","title":"Maximum Size of RPC Message

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        maxRpcMessageSize is the value of spark.rpc.message.maxSize configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#making-fake-resource-offers-on-executors","title":"Making Fake Resource Offers on Executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        makeOffers(): Unit\nmakeOffers(\n  executorId: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        makeOffers takes the active executors (out of the <> internal registry) and creates WorkerOffer resource offers for each (one per executor with the executor's id, host and free cores).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        CAUTION: Only free cores are considered in making offers. Memory is not! Why?!

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        It then requests TaskSchedulerImpl.md#resourceOffers[TaskSchedulerImpl to process the resource offers] to create a collection of TaskDescription collections that it in turn uses to launch tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#getting-executor-ids","title":"Getting Executor Ids

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When called, getExecutorIds simply returns executor ids from the internal <> registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: It is called when SparkContext.md#getExecutorIds[SparkContext calculates executor ids].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#requesting-executors","title":"Requesting Executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        requestExecutors(\n  numAdditionalExecutors: Int): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        requestExecutors is a \"decorator\" method that ultimately calls a cluster-specific doRequestTotalExecutors method and returns whether the request was acknowledged or not (it is assumed false by default).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        requestExecutors method is part of the ExecutorAllocationClient abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When called, you should see the following INFO message followed by DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Requesting [numAdditionalExecutors] additional executor(s) from the cluster manager\nNumber of pending executors is now [numPendingExecutors]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <> is increased by the input numAdditionalExecutors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        requestExecutors requests executors from a cluster manager (that reflects the current computation needs). The \"new executor total\" is a sum of the internal <> and <> decreased by the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        If numAdditionalExecutors is negative, a IllegalArgumentException is thrown:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Attempted to request a negative number of additional executor(s) [numAdditionalExecutors] from the cluster manager. Please specify a positive number!\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: It is a final method that no other scheduler backends could customize further.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: The method is a synchronized block that makes multiple concurrent requests be handled in a serial fashion, i.e. one by one.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#requesting-exact-number-of-executors","title":"Requesting Exact Number of Executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        requestTotalExecutors(\n  numExecutors: Int,\n  localityAwareTasks: Int,\n  hostToLocalTaskCount: Map[String, Int]): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        requestTotalExecutors is a \"decorator\" method that ultimately calls a cluster-specific doRequestTotalExecutors method and returns whether the request was acknowledged or not (it is assumed false by default).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        requestTotalExecutors is part of the ExecutorAllocationClient abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        It sets the internal <> and <> registries. It then calculates the exact number of executors which is the input numExecutors and the <> decreased by the number of <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        If numExecutors is negative, a IllegalArgumentException is thrown:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Attempted to request a negative number of executor(s) [numExecutors] from the cluster manager. Please specify a positive number!\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: It is a final method that no other scheduler backends could customize further.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: The method is a synchronized block that makes multiple concurrent requests be handled in a serial fashion, i.e. one by one.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#finding-default-level-of-parallelism","title":"Finding Default Level of Parallelism
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        defaultParallelism(): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        defaultParallelism is part of the SchedulerBackend abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        defaultParallelism is spark.default.parallelism configuration property if defined.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Otherwise, defaultParallelism is the maximum of totalCoreCount or 2.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#killing-task","title":"Killing Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        killTask(\n  taskId: Long,\n  executorId: String,\n  interruptThread: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        killTask is part of the SchedulerBackend abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        killTask simply sends a KillTask message to <>.","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#stopping-all-executors","title":"Stopping All Executors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        stopExecutors sends a blocking <> message to <> (if already initialized).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: It is called exclusively while CoarseGrainedSchedulerBackend is <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        You should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Shutting down all executors\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#reset-state","title":"Reset State

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        reset resets the internal state:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        1. Sets <> to 0
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        2. Clears executorsPendingToRemove
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        3. Sends a blocking <> message to <> for every executor (in the internal executorDataMap) to inform it about SlaveLost with the message: +
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Stale executor after cluster manager re-registered.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          reset is a method that is defined in CoarseGrainedSchedulerBackend, but used and overriden exclusively by yarn/spark-yarn-yarnschedulerbackend.md[YarnSchedulerBackend].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#remove-executor","title":"Remove Executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          removeExecutor(executorId: String, reason: ExecutorLossReason)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          removeExecutor sends a blocking <> message to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: It is called by subclasses spark-standalone.md#SparkDeploySchedulerBackend[SparkDeploySchedulerBackend], spark-mesos/spark-mesos.md#CoarseMesosSchedulerBackend[CoarseMesosSchedulerBackend], and yarn/spark-yarn-yarnschedulerbackend.md[YarnSchedulerBackend].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#coarsegrainedscheduler-rpc-endpoint","title":"CoarseGrainedScheduler RPC Endpoint

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          When <>, it registers CoarseGrainedScheduler RPC endpoint to be the driver's communication endpoint.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          driverEndpoint is a DriverEndpoint.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          CoarseGrainedSchedulerBackend is created while SparkContext is being created that in turn lives inside a Spark driver. That explains the name driverEndpoint (at least partially).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          It is called standalone scheduler's driver endpoint internally.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          It tracks:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          It uses driver-revive-thread daemon single-thread thread pool for ...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          CAUTION: FIXME A potential issue with driverEndpoint.asInstanceOf[NettyRpcEndpointRef].toURI - doubles spark:// prefix.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#starting-coarsegrainedschedulerbackend","title":"Starting CoarseGrainedSchedulerBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          start(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          start is part of the SchedulerBackend abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          start takes all spark.-prefixed properties and registers the <CoarseGrainedScheduler RPC endpoint>> (backed by DriverEndpoint ThreadSafeRpcEndpoint).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: start uses <> to access the current SparkContext.md[SparkContext] and in turn SparkConf.md[SparkConf].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: start uses <> that was given when <CoarseGrainedSchedulerBackend was created>>.","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#checking-if-sufficient-compute-resources-available-or-waiting-time-passedmethod","title":"Checking If Sufficient Compute Resources Available Or Waiting Time PassedMethod

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          isReady(): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          isReady is part of the SchedulerBackend abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          isReady allows to delay task launching until <> or <> passes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Internally, isReady <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: <> by default responds that sufficient resources are available.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          If the <>, you should see the following INFO message in the logs and isReady is positive.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: [minRegisteredRatio]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          If there are no sufficient resources available yet (the above requirement does not hold), isReady checks whether the time since <> passed <> to give a way to launch tasks (even when <> not being reached yet).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          You should see the following INFO message in the logs and isReady is positive.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: [maxRegisteredWaitingTimeMs](ms)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Otherwise, when <> and <> has not elapsed, isReady is negative.","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#reviving-resource-offers","title":"Reviving Resource Offers

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          reviveOffers(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          reviveOffers is part of the SchedulerBackend abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          reviveOffers simply sends a ReviveOffers message to CoarseGrainedSchedulerBackend RPC endpoint.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#stopping-schedulerbackend","title":"Stopping SchedulerBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          stop is part of the SchedulerBackend abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          stop <> and <CoarseGrainedScheduler RPC endpoint>> (by sending a blocking StopDriver message).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In case of any Exception, stop reports a SparkException with the message:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Error stopping standalone scheduler's driver endpoint\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#createdriverendpointref","title":"createDriverEndpointRef
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          createDriverEndpointRef(\n  properties: ArrayBuffer[(String, String)]): RpcEndpointRef\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          createDriverEndpointRef <> and rpc:index.md#setupEndpoint[registers it] as CoarseGrainedScheduler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          createDriverEndpointRef is used when CoarseGrainedSchedulerBackend is requested to <>.","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#checking-whether-executor-is-active","title":"Checking Whether Executor is Active

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          isExecutorActive(\n  id: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          isExecutorActive is part of the ExecutorAllocationClient abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          isExecutorActive...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#requesting-executors-from-cluster-manager","title":"Requesting Executors from Cluster Manager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          doRequestTotalExecutors(\n  requestedTotal: Int): Future[Boolean]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          doRequestTotalExecutors returns a completed Future with false value.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          doRequestTotalExecutors is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • CoarseGrainedSchedulerBackend is requested to requestExecutors, requestTotalExecutors and killExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Enable ALL logging level for org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          log4j.logger.org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/CompressedMapStatus/","title":"CompressedMapStatus","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          CompressedMapStatus is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/DAGScheduler/","title":"DAGScheduler","text":""},{"location":"scheduler/DAGScheduler/#dagscheduler","title":"DAGScheduler","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The introduction that follows was highly influenced by the scaladoc of org.apache.spark.scheduler.DAGScheduler. As DAGScheduler is a private class it does not appear in the official API documentation. You are strongly encouraged to read the sources and only then read this and the related pages afterwards.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/DAGScheduler/#introduction","title":"Introduction","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling using Jobs and Stages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          DAGScheduler transforms a logical execution plan (RDD lineage of dependencies built using RDD transformations) to a physical execution plan (using stages).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          After an action has been called on an RDD, SparkContext hands over a logical plan to DAGScheduler that it in turn translates to a set of stages that are submitted as TaskSets for execution.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          DAGScheduler works solely on the driver and is created as part of SparkContext's initialization (right after TaskScheduler and SchedulerBackend are ready).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          DAGScheduler does three things in Spark:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Computes an execution DAG (DAG of stages) for a job
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Determines the preferred locations to run each task on
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Handles failures due to shuffle output files being lost

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          DAGScheduler computes a directed acyclic graph (DAG) of stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a minimal schedule to run jobs. It then submits stages to TaskScheduler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In addition to coming up with the execution DAG, DAGScheduler also determines the preferred locations to run each task on, based on the current cache status, and passes the information to TaskScheduler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          DAGScheduler tracks which RDDs are cached (or persisted) to avoid \"recomputing\" them (re-doing the map side of a shuffle). DAGScheduler remembers what ShuffleMapStages have already produced output files (that are stored in BlockManagers).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          DAGScheduler is only interested in cache location coordinates (i.e. host and executor id, per partition of a RDD).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Furthermore, DAGScheduler handles failures due to shuffle output files being lost, in which case old stages may need to be resubmitted. Failures within a stage that are not caused by shuffle file loss are handled by the TaskScheduler itself, which will retry each task a small number of times before cancelling the whole stage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          DAGScheduler uses an event queue architecture in which a thread can post DAGSchedulerEvent events, e.g. a new job or stage being submitted, that DAGScheduler reads and executes sequentially. See the section Event Bus.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          DAGScheduler runs stages in topological order.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          DAGScheduler uses SparkContext, TaskScheduler, LiveListenerBus, MapOutputTracker and BlockManager for its services. However, at the very minimum, DAGScheduler takes a SparkContext only (and requests SparkContext for the other services).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          When DAGScheduler schedules a job as a result of executing an action on a RDD or calling SparkContext.runJob directly, it spawns parallel tasks to compute (partial) results per partition.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/DAGScheduler/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          DAGScheduler takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskScheduler
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • LiveListenerBus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • MapOutputTrackerMaster
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BlockManagerMaster
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Clock

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DAGScheduler is created\u00a0when SparkContext is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            While being created, DAGScheduler requests the TaskScheduler to associate itself with and requests DAGScheduler Event Bus to start accepting events.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#submitMapStage","title":"Submitting MapStage for Execution (Posting MapStageSubmitted)","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMapStage[K, V, C](\n  dependency: ShuffleDependency[K, V, C],\n  callback: MapOutputStatistics => Unit,\n  callSite: CallSite,\n  properties: Properties): JobWaiter[MapOutputStatistics]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMapStage requests the given ShuffleDependency for the RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMapStage gets the job ID and increments it (for future submissions).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMapStage creates a JobWaiter to wait for a MapOutputStatistics. The JobWaiter waits for 1 task and, when completed successfully, executes the given callback function with the computed MapOutputStatistics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In the end, submitMapStage posts a MapStageSubmitted and returns the JobWaiter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkContext is requested to submit a MapStage for execution
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#metricsSource","title":"DAGSchedulerSource

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DAGScheduler uses DAGSchedulerSource for performance metrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#eventProcessLoop","title":"DAGScheduler Event Bus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DAGScheduler uses an event bus to process scheduling events on a separate thread (one by one and asynchronously).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DAGScheduler requests the event bus to start right when created and stops it when requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DAGScheduler defines event-posting methods for posting DAGSchedulerEvent events to the event bus.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#taskScheduler","title":"TaskScheduler

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DAGScheduler is given a TaskScheduler when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TaskScheduler is used for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Submitting missing tasks of a stage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Handling task completion (CompletionEvent)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Killing a task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Failing a job and all other independent single-job stages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Stopping itself
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#runJob","title":"Running Job
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            runJob[T, U](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) => U,\n  partitions: Seq[Int],\n  callSite: CallSite,\n  resultHandler: (Int, U) => Unit,\n  properties: Properties): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            runJob submits a job and waits until a result is available.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            runJob prints out the following INFO message to the logs when the job has finished successfully:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Job [jobId] finished: [callSite], took [time] s\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            runJob prints out the following INFO message to the logs when the job has failed:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Job [jobId] failed: [callSite], took [time] s\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            runJob is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkContext is requested to run a job
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#submitJob","title":"Submitting Job
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitJob[T, U](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) => U,\n  partitions: Seq[Int],\n  callSite: CallSite,\n  resultHandler: (Int, U) => Unit,\n  properties: Properties): JobWaiter[U]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitJob increments the nextJobId internal counter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitJob creates a JobWaiter for the (number of) partitions and the given resultHandler function.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitJob requests the DAGSchedulerEventProcessLoop to post a JobSubmitted.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In the end, submitJob returns the JobWaiter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For empty partitions (no partitions to compute), submitJob requests the LiveListenerBus to post a SparkListenerJobStart and SparkListenerJobEnd (with JobSucceeded result marker) events and returns a JobWaiter with no tasks to wait for.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitJob throws an IllegalArgumentException when the partitions indices are not among the partitions of the given RDD:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Attempting to access a non-existent partition: [p]. Total number of partitions: [maxPartitions]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitJob is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkContext is requested to submit a job
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to run a job
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#cacheLocs","title":"Partition Placement Preferences

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DAGScheduler keeps track of block locations per RDD and partition.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DAGScheduler uses TaskLocation that includes a host name and an executor id on that host (as ExecutorCacheTaskLocation).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The keys are RDDs (their ids) and the values are arrays indexed by partition numbers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Each entry is a set of block locations where a RDD partition is cached, i.e. the BlockManagers of the blocks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Initialized empty when DAGScheduler is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when DAGScheduler is requested for the locations of the cache blocks of a RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#activeJobs","title":"ActiveJobs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DAGScheduler tracks ActiveJobs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Adds a new ActiveJob when requested to handle JobSubmitted or MapStageSubmitted events

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Removes an ActiveJob when requested to clean up after an ActiveJob and independent stages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Removes all ActiveJobs when requested to doCancelAllJobs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DAGScheduler uses ActiveJobs registry when requested to handle JobGroupCancelled or TaskCompletion events, to cleanUpAfterSchedulerStop and to abort a stage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The number of ActiveJobs is available using job.activeJobs performance metric.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#createResultStage","title":"Creating ResultStage for RDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createResultStage(\n  rdd: RDD[_],\n  func: (TaskContext, Iterator[_]) => _,\n  partitions: Array[Int],\n  jobId: Int,\n  callSite: CallSite): ResultStage\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createResultStage creates a new ResultStage for the ShuffleDependencies and ResourceProfiles of the given RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createResultStage finds the ShuffleDependencies and ResourceProfiles for the given RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createResultStage merges the ResourceProfiles for the Stage (if enabled or reports an exception).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createResultStage does the following checks (that may report violations and break the execution):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • checkBarrierStageWithDynamicAllocation
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • checkBarrierStageWithNumSlots
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • checkBarrierStageWithRDDChainPattern

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createResultStage getOrCreateParentStages (with the ShuffleDependencyies and the given jobId).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createResultStage uses the nextStageId counter for a stage ID.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createResultStage creates a new ResultStage (with the unique id of a ResourceProfile among others).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createResultStage registers the ResultStage with the stage ID in stageIdToStage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createResultStage updateJobIdStageIdMaps and returns the ResultStage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createResultStage is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to handle a JobSubmitted event
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#createShuffleMapStage","title":"Creating ShuffleMapStage for ShuffleDependency
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createShuffleMapStage(\n  shuffleDep: ShuffleDependency[_, _, _],\n  jobId: Int): ShuffleMapStage\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createShuffleMapStage creates a ShuffleMapStage for the given ShuffleDependency as follows:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Stage ID is generated based on nextStageId internal counter

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • RDD is taken from the given ShuffleDependency

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Number of tasks is the number of partitions of the RDD

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Parent RDDs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MapOutputTrackerMaster

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createShuffleMapStage registers the ShuffleMapStage in the stageIdToStage and shuffleIdToMapStage internal registries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createShuffleMapStage updateJobIdStageIdMaps.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createShuffleMapStage requests the MapOutputTrackerMaster to check whether it contains the shuffle ID or not.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If not, createShuffleMapStage prints out the following INFO message to the logs and requests the MapOutputTrackerMaster to register the shuffle.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Registering RDD [id] ([creationSite]) as input to shuffle [shuffleId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            createShuffleMapStage is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to find or create a ShuffleMapStage for a given ShuffleDependency
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#cleanupStateForJobAndIndependentStages","title":"Cleaning Up After Job and Independent Stages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            cleanupStateForJobAndIndependentStages(\n  job: ActiveJob): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            cleanupStateForJobAndIndependentStages cleans up the state for job and any stages that are not part of any other job.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            cleanupStateForJobAndIndependentStages looks the job up in the internal jobIdToStageIds registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If no stages are found, the following ERROR is printed out to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            No stages registered for job [jobId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Oterwise, cleanupStateForJobAndIndependentStages uses stageIdToStage registry to find the stages (the real objects not ids!).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For each stage, cleanupStateForJobAndIndependentStages reads the jobs the stage belongs to.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the job does not belong to the jobs of the stage, the following ERROR is printed out to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Job [jobId] not registered for stage [stageId] even though that stage was registered for the job\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the job was the only job for the stage, the stage (and the stage id) gets cleaned up from the registries, i.e. runningStages, shuffleIdToMapStage, waitingStages, failedStages and stageIdToStage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            While removing from runningStages, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Removing running stage [stageId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            While removing from waitingStages, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Removing stage [stageId] from waiting set.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            While removing from failedStages, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Removing stage [stageId] from failed set.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            After all cleaning (using stageIdToStage as the source registry), if the stage belonged to the one and only job, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            After removal of stage [stageId], remaining stages = [stageIdToStage.size]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The job is removed from jobIdToStageIds, jobIdToActiveJob, activeJobs registries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The final stage of the job is removed, i.e. ResultStage or ShuffleMapStage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            cleanupStateForJobAndIndependentStages is used in handleTaskCompletion when a ResultTask has completed successfully, failJobAndIndependentStages and markMapStageJobAsFinished.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#markMapStageJobAsFinished","title":"Marking ShuffleMapStage Job Finished
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            markMapStageJobAsFinished(\n  job: ActiveJob,\n  stats: MapOutputStatistics): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            markMapStageJobAsFinished marks the given ActiveJob finished and posts a SparkListenerJobEnd.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            markMapStageJobAsFinished requests the given ActiveJob to turn on (true) the 0th bit in the finished partitions registry and increase the number of tasks finished.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            markMapStageJobAsFinished requests the given ActiveJob for the JobListener that is requested to taskSucceeded (with the 0th index and the given MapOutputStatistics).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            markMapStageJobAsFinished cleanupStateForJobAndIndependentStages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In the end, markMapStageJobAsFinished requests the LiveListenerBus to post a SparkListenerJobEnd.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            markMapStageJobAsFinished is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to handleMapStageSubmitted and markMapStageJobsAsFinished
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#getOrCreateParentStages","title":"Finding Or Creating Missing Direct Parent ShuffleMapStages (For ShuffleDependencies) of RDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getOrCreateParentStages(\n  rdd: RDD[_],\n  firstJobId: Int): List[Stage]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getOrCreateParentStages finds all direct parent ShuffleDependencies of the input rdd and then finds ShuffleMapStages for each ShuffleDependency.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getOrCreateParentStages is used when DAGScheduler is requested to create a ShuffleMapStage or a ResultStage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#markStageAsFinished","title":"Marking Stage Finished
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            markStageAsFinished(\n  stage: Stage,\n  errorMessage: Option[String] = None,\n  willRetry: Boolean = false): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            markStageAsFinished...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            markStageAsFinished is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#getOrCreateShuffleMapStage","title":"Looking Up ShuffleMapStage for ShuffleDependency
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getOrCreateShuffleMapStage(\n  shuffleDep: ShuffleDependency[_, _, _],\n  firstJobId: Int): ShuffleMapStage\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getOrCreateShuffleMapStage finds a ShuffleMapStage by the shuffleId of the given ShuffleDependency in the shuffleIdToMapStage internal registry and returns it if available.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If not found, getOrCreateShuffleMapStage finds all the missing ancestor shuffle dependencies and creates the missing ShuffleMapStage stages (including one for the input ShuffleDependency).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getOrCreateShuffleMapStage is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to find or create missing direct parent ShuffleMapStages of an RDD, find missing parent ShuffleMapStages for a stage, handle a MapStageSubmitted event, and check out stage dependency on a stage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#getMissingAncestorShuffleDependencies","title":"Missing ShuffleDependencies of RDD","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getMissingAncestorShuffleDependencies(\n   rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getMissingAncestorShuffleDependencies finds all the missing ShuffleDependencies for the given RDD (traversing its RDD lineage).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A ShuffleDependency (of an RDD) is considered missing when not registered in the shuffleIdToMapStage internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Internally, getMissingAncestorShuffleDependencies finds direct parent shuffle dependencies\u2009of the input RDD and collects the ones that are not registered in the shuffleIdToMapStage internal registry. It repeats the process for the RDDs of the parent shuffle dependencies.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#getShuffleDependencies","title":"Finding Direct Parent Shuffle Dependencies of RDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getShuffleDependencies(\n   rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getShuffleDependencies finds direct parent shuffle dependencies for the given RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Internally, getShuffleDependencies takes the direct rdd/index.md#dependencies[shuffle dependencies of the input RDD] and direct shuffle dependencies of all the parent non-ShuffleDependencies in the RDD lineage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getShuffleDependencies is used when DAGScheduler is requested to find or create missing direct parent ShuffleMapStages (for ShuffleDependencies of a RDD) and find all missing shuffle dependencies for a given RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#failJobAndIndependentStages","title":"Failing Job and Independent Single-Job Stages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            failJobAndIndependentStages(\n  job: ActiveJob,\n  failureReason: String,\n  exception: Option[Throwable] = None): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            failJobAndIndependentStages fails the input job and all the stages that are only used by the job.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Internally, failJobAndIndependentStages uses jobIdToStageIds internal registry to look up the stages registered for the job.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If no stages could be found, you should see the following ERROR message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            No stages registered for job [id]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Otherwise, for every stage, failJobAndIndependentStages finds the job ids the stage belongs to.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If no stages could be found or the job is not referenced by the stages, you should see the following ERROR message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Job [id] not registered for stage [id] even though that stage was registered for the job\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Only when there is exactly one job registered for the stage and the stage is in RUNNING state (in runningStages internal registry), TaskScheduler.md#contract[TaskScheduler is requested to cancel the stage's tasks] and marks the stage finished.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: failJobAndIndependentStages uses jobIdToStageIds, stageIdToStage, and runningStages internal registries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            failJobAndIndependentStages is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#abortStage","title":"Aborting Stage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            abortStage(\n  failedStage: Stage,\n  reason: String,\n  exception: Option[Throwable]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            abortStage is an internal method that finds all the active jobs that depend on the failedStage stage and fails them.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Internally, abortStage looks the failedStage stage up in the internal stageIdToStage registry and exits if there the stage was not registered earlier.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If it was, abortStage finds all the active jobs (in the internal activeJobs registry) with the final stage depending on the failedStage stage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            At this time, the completionTime property (of the failed stage's StageInfo) is assigned to the current time (millis).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            All the active jobs that depend on the failed stage (as calculated above) and the stages that do not belong to other jobs (aka independent stages) are failed (with the failure reason being \"Job aborted due to stage failure: [reason]\" and the input exception).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If there are no jobs depending on the failed stage, you should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Ignoring failure of [failedStage] because all jobs depending on it are done\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            abortStage is used when DAGScheduler is requested to handle a TaskSetFailed event, submit a stage, submit missing tasks of a stage, handle a TaskCompletion event.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#stageDependsOn","title":"Checking Out Stage Dependency on Given Stage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            stageDependsOn(\n  stage: Stage,\n  target: Stage): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            stageDependsOn compares two stages and returns whether the stage depends on target stage (i.e. true) or not (i.e. false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: A stage A depends on stage B if B is among the ancestors of A.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Internally, stageDependsOn walks through the graph of RDDs of the input stage. For every RDD in the RDD's dependencies (using RDD.dependencies) stageDependsOn adds the RDD of a NarrowDependency to a stack of RDDs to visit while for a ShuffleDependency it finds ShuffleMapStage stages for a ShuffleDependency for the dependency and the stage's first job id that it later adds to a stack of RDDs to visit if the map stage is ready, i.e. all the partitions have shuffle outputs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            After all the RDDs of the input stage are visited, stageDependsOn checks if the target's RDD is among the RDDs of the stage, i.e. whether the stage depends on target stage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            stageDependsOn is used when DAGScheduler is requested to abort a stage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#submitWaitingChildStages","title":"Submitting Waiting Child Stages for Execution
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitWaitingChildStages(\n  parent: Stage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitWaitingChildStages submits for execution all waiting stages for which the input parent Stage.md[Stage] is the direct parent.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: Waiting stages are the stages registered in waitingStages internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When executed, you should see the following TRACE messages in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Checking if any dependencies of [parent] are now runnable\nrunning: [runningStages]\nwaiting: [waitingStages]\nfailed: [failedStages]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitWaitingChildStages finds child stages of the input parent stage, removes them from waitingStages internal registry, and submits one by one sorted by their job ids.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitWaitingChildStages is used when DAGScheduler is requested to submits missing tasks for a stage and handles a successful ShuffleMapTask completion.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#submitStage","title":"Submitting Stage (with Missing Parents) for Execution
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitStage(\n  stage: Stage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitStage submits the input stage or its missing parents (if there any stages not computed yet before the input stage could).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: submitStage is also used to DAGSchedulerEventProcessLoop.md#resubmitFailedStages[resubmit failed stages].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitStage recursively submits any missing parents of the stage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Internally, submitStage first finds the earliest-created job id that needs the stage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: A stage itself tracks the jobs (their ids) it belongs to (using the internal jobIds registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The following steps depend on whether there is a job or not.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If there are no jobs that require the stage, submitStage aborts it with the reason:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            No active job for stage [id]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If however there is a job for the stage, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitStage([stage])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitStage checks the status of the stage and continues when it was not recorded in waiting, running or failed internal registries. It simply exits otherwise.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            With the stage ready for submission, submitStage calculates the list of missing parent stages of the stage (sorted by their job ids). You should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            missing: [missing]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When the stage has no parent stages missing, you should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Submitting [stage] ([stage.rdd]), which has no missing parents\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitStage submits the stage (with the earliest-created job id) and finishes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If however there are missing parent stages for the stage, submitStage submits all the parent stages, and the stage is recorded in the internal waitingStages registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitStage is used recursively for missing parents of the given stage and when DAGScheduler is requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • resubmitFailedStages (ResubmitFailedStages event)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • submitWaitingChildStages (CompletionEvent event)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Handle JobSubmitted, MapStageSubmitted and TaskCompletion events

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#stage-attempts","title":"Stage Attempts

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A single stage can be re-executed in multiple attempts due to fault recovery. The number of attempts is configured (FIXME).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If TaskScheduler reports that a task failed because a map output file from a previous stage was lost, the DAGScheduler resubmits the lost stage. This is detected through a DAGSchedulerEventProcessLoop.md#handleTaskCompletion-FetchFailed[CompletionEvent with FetchFailed], or an ExecutorLost event. DAGScheduler will wait a small amount of time to see whether other nodes or tasks fail, then resubmit TaskSets for any lost stage(s) that compute the missing tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Please note that tasks from the old attempts of a stage could still be running.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A stage object tracks multiple StageInfo objects to pass to Spark listeners or the web UI.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The latest StageInfo for the most recent attempt for a stage is accessible through latestInfo.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#preferred-locations","title":"Preferred Locations

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DAGScheduler computes where to run each task in a stage based on the rdd/index.md#getPreferredLocations[preferred locations of its underlying RDDs], or the location of cached or shuffle data.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#adaptive-query-planning","title":"Adaptive Query Planning / Adaptive Scheduling

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            See SPARK-9850 Adaptive execution in Spark for the design document. The work is currently in progress.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DAGScheduler.submitMapStage method is used for adaptive query planning, to run map stages and look at statistics about their outputs before submitting downstream stages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#scheduledexecutorservice-daemon-services","title":"ScheduledExecutorService daemon services

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DAGScheduler uses the following ScheduledThreadPoolExecutors (with the policy of removing cancelled tasks from a work queue at time of cancellation):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • dag-scheduler-message - a daemon thread pool using j.u.c.ScheduledThreadPoolExecutor with core pool size 1. It is used to post a DAGSchedulerEventProcessLoop.md#ResubmitFailedStages[ResubmitFailedStages] event when DAGSchedulerEventProcessLoop.md#handleTaskCompletion-FetchFailed[FetchFailed is reported].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            They are created using ThreadUtils.newDaemonSingleThreadScheduledExecutor method that uses Guava DSL to instantiate a ThreadFactory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#getMissingParentStages","title":"Finding Missing Parent ShuffleMapStages For Stage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getMissingParentStages(\n  stage: Stage): List[Stage]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getMissingParentStages finds missing parent ShuffleMapStages in the dependency graph of the input stage (using the breadth-first search algorithm).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Internally, getMissingParentStages starts with the stage's RDD and walks up the tree of all parent RDDs to find uncached partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: A Stage tracks the associated RDD using Stage.md#rdd[rdd property].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: An uncached partition of a RDD is a partition that has Nil in the internal registry of partition locations per RDD (which results in no RDD blocks in any of the active storage:BlockManager.md[BlockManager]s on executors).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getMissingParentStages traverses the rdd/index.md#dependencies[parent dependencies of the RDD] and acts according to their type, i.e. ShuffleDependency or NarrowDependency.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: ShuffleDependency and NarrowDependency are the main top-level Dependencies.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For each NarrowDependency, getMissingParentStages simply marks the corresponding RDD to visit and moves on to a next dependency of a RDD or works on another unvisited parent RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: NarrowDependency is a RDD dependency that allows for pipelined execution.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getMissingParentStages focuses on ShuffleDependency dependencies.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: ShuffleDependency is a RDD dependency that represents a dependency on the output of a ShuffleMapStage, i.e. shuffle map stage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For each ShuffleDependency, getMissingParentStages finds ShuffleMapStage stages. If the ShuffleMapStage is not available, it is added to the set of missing (map) stages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: A ShuffleMapStage is available when all its partitions are computed, i.e. results are available (as blocks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME...IMAGE with ShuffleDependencies queried

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getMissingParentStages is used when DAGScheduler is requested to submit a stage and handle JobSubmitted and MapStageSubmitted events.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#submitMissingTasks","title":"Submitting Missing Tasks of Stage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMissingTasks(\n  stage: Stage,\n  jobId: Int): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMissingTasks prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMissingTasks([stage])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMissingTasks requests the given Stage for the missing partitions (partitions that need to be computed).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMissingTasks adds the stage to the runningStages internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMissingTasks notifies the OutputCommitCoordinator that stage execution started.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMissingTasks determines preferred locations (task locality preferences) of the missing partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMissingTasks requests the stage for a new stage attempt.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMissingTasks requests the LiveListenerBus to post a SparkListenerStageSubmitted event.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMissingTasks uses the closure Serializer to serialize the stage and create a so-called task binary. submitMissingTasks serializes the RDD (of the stage) and either the ShuffleDependency or the compute function based on the type of the stage (ShuffleMapStage or ResultStage, respectively).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMissingTasks creates a broadcast variable for the task binary.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            That shows how important broadcast variables are for Spark itself to distribute data among executors in a Spark application in the most efficient way.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMissingTasks creates tasks for every missing partition:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffleMapTasks for a ShuffleMapStage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ResultTasks for a ResultStage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If there are tasks to submit for execution (i.e. there are missing partitions in the stage), submitMissingTasks prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Submitting [size] missing tasks from [stage] ([rdd]) (first 15 tasks are for partitions [partitionIds])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMissingTasks requests the TaskScheduler to TaskScheduler.md#submitTasks[submit the tasks for execution] (as a new TaskSet.md[TaskSet]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            With no tasks to submit for execution, submitMissingTasks marks the stage as finished successfully.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMissingTasks prints out the following DEBUG messages based on the type of the stage:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Stage [stage] is actually done; (available: [isAvailable],available outputs: [numAvailableOutputs],partitions: [numPartitions])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            or

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Stage [stage] is actually done; (partitions: [numPartitions])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            for ShuffleMapStage and ResultStage, respectively.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In the end, with no tasks to submit for execution, submitMissingTasks submits waiting child stages for execution and exits.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submitMissingTasks is used when DAGScheduler is requested to submit a stage for execution.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#getPreferredLocs","title":"Finding Preferred Locations for Missing Partitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getPreferredLocs(\n  rdd: RDD[_],\n  partition: Int): Seq[TaskLocation]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getPreferredLocs is simply an alias for the internal (recursive) getPreferredLocsInternal.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getPreferredLocs is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkContext is requested to getPreferredLocs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to submit the missing tasks of a stage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#getCacheLocs","title":"Finding BlockManagers (Executors) for Cached RDD Partitions (aka Block Location Discovery)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getCacheLocs(\n   rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getCacheLocs gives TaskLocations (block locations) for the partitions of the input rdd. getCacheLocs caches lookup results in cacheLocs internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: The size of the collection from getCacheLocs is exactly the number of partitions in rdd RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: The size of every TaskLocation collection (i.e. every entry in the result of getCacheLocs) is exactly the number of blocks managed using storage:BlockManager.md[BlockManagers] on executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Internally, getCacheLocs finds rdd in the cacheLocs internal registry (of partition locations per RDD).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If rdd is not in cacheLocs internal registry, getCacheLocs branches per its storage:StorageLevel.md[storage level].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For NONE storage level (i.e. no caching), the result is an empty locations (i.e. no location preference).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For other non-NONE storage levels, getCacheLocs storage:BlockManagerMaster.md#getLocations-block-array[requests BlockManagerMaster for block locations] that are then mapped to TaskLocations with the hostname of the owning BlockManager for a block (of a partition) and the executor id.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getCacheLocs records the computed block locations per partition (as TaskLocation) in cacheLocs internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: getCacheLocs requests locations from BlockManagerMaster using storage:BlockId.md#RDDBlockId[RDDBlockId] with the RDD id and the partition indices (which implies that the order of the partitions matters to request proper blocks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: DAGScheduler uses TaskLocation.md[TaskLocations] (with host and executor) while storage:BlockManagerMaster.md[BlockManagerMaster] uses storage:BlockManagerId.md[] (to track similar information, i.e. block locations).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getCacheLocs is used when DAGScheduler is requested to find missing parent MapStages and getPreferredLocsInternal.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#getPreferredLocsInternal","title":"Finding Placement Preferences for RDD Partition (recursively)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getPreferredLocsInternal(\n   rdd: RDD[_],\n  partition: Int,\n  visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getPreferredLocsInternal first finds the TaskLocations for the partition of the rdd (using cacheLocs internal cache) and returns them.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Otherwise, if not found, getPreferredLocsInternal rdd/index.md#preferredLocations[requests rdd for the preferred locations of partition] and returns them.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: Preferred locations of the partitions of a RDD are also called placement preferences or locality preferences.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Otherwise, if not found, getPreferredLocsInternal finds the first parent NarrowDependency and (recursively) finds TaskLocations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If all the attempts fail to yield any non-empty result, getPreferredLocsInternal returns an empty collection of TaskLocation.md[TaskLocations].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getPreferredLocsInternal is used when DAGScheduler is requested for the preferred locations for missing partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#stop","title":"Stopping DAGScheduler
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            stop stops the internal dag-scheduler-message thread pool, dag-scheduler-event-loop, and TaskScheduler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            stop is used when SparkContext is requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#killTaskAttempt","title":"Killing Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            killTaskAttempt(\n  taskId: Long,\n  interruptThread: Boolean,\n  reason: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            killTaskAttempt requests the TaskScheduler to kill a task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            killTaskAttempt is used when SparkContext is requested to kill a task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#cleanUpAfterSchedulerStop","title":"cleanUpAfterSchedulerStop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            cleanUpAfterSchedulerStop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            cleanUpAfterSchedulerStop...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            cleanUpAfterSchedulerStop is used when DAGSchedulerEventProcessLoop is requested to onStop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#removeExecutorAndUnregisterOutputs","title":"removeExecutorAndUnregisterOutputs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            removeExecutorAndUnregisterOutputs(\n  execId: String,\n  fileLost: Boolean,\n  hostToUnregisterOutputs: Option[String],\n  maybeEpoch: Option[Long] = None): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            removeExecutorAndUnregisterOutputs...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            removeExecutorAndUnregisterOutputs is used when DAGScheduler is requested to handle task completion (due to a fetch failure) and executor lost events.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#markMapStageJobsAsFinished","title":"markMapStageJobsAsFinished
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            markMapStageJobsAsFinished(\n  shuffleStage: ShuffleMapStage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            markMapStageJobsAsFinished checks out whether the given ShuffleMapStage is fully-available yet there are still map-stage jobs running.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If so, markMapStageJobsAsFinished requests the MapOutputTrackerMaster for the statistics (for the ShuffleDependency of the given ShuffleMapStage).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For every map-stage job, markMapStageJobsAsFinished marks the map-stage job as finished (with the statistics).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            markMapStageJobsAsFinished is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to submit missing tasks (of a ShuffleMapStage that has just been computed) and processShuffleMapStageCompletion
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#processShuffleMapStageCompletion","title":"processShuffleMapStageCompletion
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            processShuffleMapStageCompletion(\n  shuffleStage: ShuffleMapStage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            processShuffleMapStageCompletion...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            processShuffleMapStageCompletion is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to handleTaskCompletion and handleShuffleMergeFinalized
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#handleShuffleMergeFinalized","title":"handleShuffleMergeFinalized
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleShuffleMergeFinalized(\n  stage: ShuffleMapStage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleShuffleMergeFinalized...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleShuffleMergeFinalized is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGSchedulerEventProcessLoop is requested to handle a ShuffleMergeFinalized event
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#scheduleShuffleMergeFinalize","title":"scheduleShuffleMergeFinalize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            scheduleShuffleMergeFinalize(\n  stage: ShuffleMapStage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            scheduleShuffleMergeFinalize...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            scheduleShuffleMergeFinalize is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to handle a task completion
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#finalizeShuffleMerge","title":"finalizeShuffleMerge","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            finalizeShuffleMerge(\n  stage: ShuffleMapStage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            finalizeShuffleMerge...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#updateJobIdStageIdMaps","title":"updateJobIdStageIdMaps
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            updateJobIdStageIdMaps(\n  jobId: Int,\n  stage: Stage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            updateJobIdStageIdMaps...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            updateJobIdStageIdMaps is used when DAGScheduler is requested to create ShuffleMapStage and ResultStage stages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#executorHeartbeatReceived","title":"executorHeartbeatReceived
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            executorHeartbeatReceived(\n  execId: String,\n  // (taskId, stageId, stageAttemptId, accumUpdates)\n  accumUpdates: Array[(Long, Int, Int, Seq[AccumulableInfo])],\n  blockManagerId: BlockManagerId,\n  // (stageId, stageAttemptId) -> metrics\n  executorUpdates: mutable.Map[(Int, Int), ExecutorMetrics]): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            executorHeartbeatReceived posts a SparkListenerExecutorMetricsUpdate (to listenerBus) and informs BlockManagerMaster that blockManagerId block manager is alive (by posting BlockManagerHeartbeat).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            executorHeartbeatReceived is used when TaskSchedulerImpl is requested to handle an executor heartbeat.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#event-handlers","title":"Event Handlers","text":""},{"location":"scheduler/DAGScheduler/#doCancelAllJobs","title":"AllJobsCancelled Event Handler","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            doCancelAllJobs(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            doCancelAllJobs...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            doCancelAllJobs is used when DAGSchedulerEventProcessLoop is requested to handle an AllJobsCancelled event and onError.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleBeginEvent","title":"BeginEvent Event Handler","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleBeginEvent(\n  task: Task[_],\n  taskInfo: TaskInfo): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleBeginEvent...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleBeginEvent is used when DAGSchedulerEventProcessLoop is requested to handle a BeginEvent event.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleTaskCompletion","title":"Handling Task Completion Event","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskCompletion(\n  event: CompletionEvent): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskCompletion handles a CompletionEvent.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskCompletion notifies the OutputCommitCoordinator that a task completed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskCompletion finds the stage in the stageIdToStage registry. If not found, handleTaskCompletion postTaskEnd and quits.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskCompletion updateAccumulators.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskCompletion announces task completion application-wide.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskCompletion branches off per TaskEndReason (as event.reason).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TaskEndReason Description Success Acts according to the type of the task that completed, i.e. ShuffleMapTask and ResultTask Resubmitted others"},{"location":"scheduler/DAGScheduler/#handleTaskCompletion-Success","title":"Handling Successful Task Completion","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When a task has finished successfully (i.e. Success end reason), handleTaskCompletion marks the partition as no longer pending (i.e. the partition the task worked on is removed from pendingPartitions of the stage).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: A Stage tracks its own pending partitions using scheduler:Stage.md#pendingPartitions[pendingPartitions property].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskCompletion branches off given the type of the task that completed, i.e. ShuffleMapTask and ResultTask.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleTaskCompletion-Success-ResultTask","title":"Handling Successful ResultTask Completion","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For scheduler:ResultTask.md[ResultTask], the stage is assumed a scheduler:ResultStage.md[ResultStage].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskCompletion finds the ActiveJob associated with the ResultStage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: scheduler:ResultStage.md[ResultStage] tracks the optional ActiveJob as scheduler:ResultStage.md#activeJob[activeJob property]. There could only be one active job for a ResultStage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If there is no job for the ResultStage, you should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Ignoring result from [task] because its job has finished\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Otherwise, when the ResultStage has a ActiveJob, handleTaskCompletion checks the status of the partition output for the partition the ResultTask ran for.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: ActiveJob tracks task completions in finished property with flags for every partition in a stage. When the flag for a partition is enabled (i.e. true), it is assumed that the partition has been computed (and no results from any ResultTask are expected and hence simply ignored).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME Describe why could a partition has more ResultTask running.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskCompletion ignores the CompletionEvent when the partition has already been marked as completed for the stage and simply exits.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskCompletion scheduler:DAGScheduler.md#updateAccumulators[updates accumulators].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The partition for the ActiveJob (of the ResultStage) is marked as computed and the number of partitions calculated increased.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: ActiveJob tracks what partitions have already been computed and their number.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the ActiveJob has finished (when the number of partitions computed is exactly the number of partitions in a stage) handleTaskCompletion does the following (in order):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. scheduler:DAGScheduler.md#markStageAsFinished[Marks ResultStage computed].
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. scheduler:DAGScheduler.md#cleanupStateForJobAndIndependentStages[Cleans up after ActiveJob and independent stages].
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            3. Announces the job completion application-wide (by posting a SparkListener.md#SparkListenerJobEnd[SparkListenerJobEnd] to scheduler:LiveListenerBus.md[]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In the end, handleTaskCompletion notifies JobListener of the ActiveJob that the task succeeded.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: A task succeeded notification holds the output index and the result.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When the notification throws an exception (because it runs user code), handleTaskCompletion notifies JobListener about the failure (wrapping it inside a SparkDriverExecutionException exception).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleTaskCompletion-Success-ShuffleMapTask","title":"Handling Successful ShuffleMapTask Completion","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For scheduler:ShuffleMapTask.md[ShuffleMapTask], the stage is assumed a scheduler:ShuffleMapStage.md[ShuffleMapStage].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskCompletion scheduler:DAGScheduler.md#updateAccumulators[updates accumulators].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The task's result is assumed scheduler:MapStatus.md[MapStatus] that knows the executor where the task has finished.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            You should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ShuffleMapTask finished on [execId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the executor is registered in scheduler:DAGScheduler.md#failedEpoch[failedEpoch internal registry] and the epoch of the completed task is not greater than that of the executor (as in failedEpoch registry), you should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Ignoring possibly bogus [task] completion from executor [executorId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Otherwise, handleTaskCompletion scheduler:ShuffleMapStage.md#addOutputLoc[registers the MapStatus result for the partition with the stage] (of the completed task).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskCompletion does more processing only if the ShuffleMapStage is registered as still running (in scheduler:DAGScheduler.md#runningStages[runningStages internal registry]) and the scheduler:Stage.md#pendingPartitions[ShuffleMapStage stage has no pending partitions to compute].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The ShuffleMapStage is marked as finished.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            You should see the following INFO messages in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            looking for newly runnable stages\nrunning: [runningStages]\nwaiting: [waitingStages]\nfailed: [failedStages]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskCompletion scheduler:MapOutputTrackerMaster.md#registerMapOutputs[registers the shuffle map outputs of the ShuffleDependency with MapOutputTrackerMaster] (with the epoch incremented) and scheduler:DAGScheduler.md#clearCacheLocs[clears internal cache of the stage's RDD block locations].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: scheduler:MapOutputTrackerMaster.md[MapOutputTrackerMaster] is given when scheduler:DAGScheduler.md#creating-instance[DAGScheduler is created].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the scheduler:ShuffleMapStage.md#isAvailable[ShuffleMapStage stage is ready], all scheduler:ShuffleMapStage.md#mapStageJobs[active jobs of the stage] (aka map-stage jobs) are scheduler:DAGScheduler.md#markMapStageJobAsFinished[marked as finished] (with scheduler:MapOutputTrackerMaster.md#getStatistics[MapOutputStatistics from MapOutputTrackerMaster for the ShuffleDependency]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: A ShuffleMapStage stage is ready (aka available) when all partitions have shuffle outputs, i.e. when their tasks have completed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Eventually, handleTaskCompletion scheduler:DAGScheduler.md#submitWaitingChildStages[submits waiting child stages (of the ready ShuffleMapStage)].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If however the ShuffleMapStage is not ready, you should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Resubmitting [shuffleStage] ([shuffleStage.name]) because some of its tasks had failed: [missingPartitions]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In the end, handleTaskCompletion scheduler:DAGScheduler.md#submitStage[submits the ShuffleMapStage for execution].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleTaskCompletion-Resubmitted","title":"TaskEndReason: Resubmitted","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For Resubmitted case, you should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Resubmitted [task], so marking it as still running\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The task (by task.partitionId) is added to the collection of pending partitions of the stage (using stage.pendingPartitions).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TIP: A stage knows how many partitions are yet to be calculated. A task knows about the partition id for which it was launched.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleTaskCompletion-FetchFailed","title":"Task Failed with FetchFailed Exception","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            FetchFailed(\n  bmAddress: BlockManagerId,\n  shuffleId: Int,\n  mapId: Int,\n  reduceId: Int,\n  message: String)\nextends TaskFailedReason\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When FetchFailed happens, stageIdToStage is used to access the failed stage (using task.stageId and the task is available in event in handleTaskCompletion(event: CompletionEvent)). shuffleToMapStage is used to access the map stage (using shuffleId).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If failedStage.latestInfo.attemptId != task.stageAttemptId, you should see the following INFO in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Ignoring fetch failure from [task] as it's from [failedStage] attempt [task.stageAttemptId] and there is a more recent attempt for that stage (attempt ID [failedStage.latestInfo.attemptId]) running\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME What does failedStage.latestInfo.attemptId != task.stageAttemptId mean?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            And the case finishes. Otherwise, the case continues.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the failed stage is in runningStages, the following INFO message shows in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Marking [failedStage] ([failedStage.name]) as failed due to a fetch failure from [mapStage] ([mapStage.name])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            markStageAsFinished(failedStage, Some(failureMessage)) is called.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME What does markStageAsFinished do?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the failed stage is not in runningStages, the following DEBUG message shows in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Received fetch failure from [task], but its from [failedStage] which is no longer running\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When disallowStageRetryForTest is set, abortStage(failedStage, \"Fetch failure will not retry stage due to testing config\", None) is called.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME Describe disallowStageRetryForTest and abortStage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the scheduler:Stage.md#failedOnFetchAndShouldAbort[number of fetch failed attempts for the stage exceeds the allowed number], the scheduler:DAGScheduler.md#abortStage[failed stage is aborted] with the reason:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            [failedStage] ([name]) has failed the maximum allowable number of times: 4. Most recent failure reason: [failureMessage]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If there are no failed stages reported (scheduler:DAGScheduler.md#failedStages[DAGScheduler.failedStages] is empty), the following INFO shows in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Resubmitting [mapStage] ([mapStage.name]) and [failedStage] ([failedStage.name]) due to fetch failure\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            And the following code is executed:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            messageScheduler.schedule(\n  new Runnable {\n    override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages)\n  }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME What does the above code do?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For all the cases, the failed stage and map stages are both added to the internal scheduler:DAGScheduler.md#failedStages[registry of failed stages].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If mapId (in the FetchFailed object for the case) is provided, the map stage output is cleaned up (as it is broken) using mapStage.removeOutputLoc(mapId, bmAddress) and scheduler:MapOutputTracker.md#unregisterMapOutput[MapOutputTrackerMaster.unregisterMapOutput(shuffleId, mapId, bmAddress)] methods.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME What does mapStage.removeOutputLoc do?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If BlockManagerId (as bmAddress in the FetchFailed object) is defined, handleTaskCompletion notifies DAGScheduler that an executor was lost (with filesLost enabled and maybeEpoch from the scheduler:Task.md#epoch[Task] that completed).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskCompletion is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGSchedulerEventProcessLoop is requested to handle a CompletionEvent event.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleExecutorAdded","title":"ExecutorAdded Event Handler","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleExecutorAdded(\n  execId: String,\n  host: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleExecutorAdded...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleExecutorAdded is used when DAGSchedulerEventProcessLoop is requested to handle an ExecutorAdded event.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleExecutorLost","title":"ExecutorLost Event Handler","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleExecutorLost(\n  execId: String,\n  workerLost: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleExecutorLost checks whether the input optional maybeEpoch is defined and if not requests the scheduler:MapOutputTracker.md#getEpoch[current epoch from MapOutputTrackerMaster].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: MapOutputTrackerMaster is passed in (as mapOutputTracker) when scheduler:DAGScheduler.md#creating-instance[DAGScheduler is created].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME When is maybeEpoch passed in?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            .DAGScheduler.handleExecutorLost image::dagscheduler-handleExecutorLost.png[align=\"center\"]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Recurring ExecutorLost events lead to the following repeating DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DEBUG Additional executor lost message for [execId] (epoch [currentEpoch])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: handleExecutorLost handler uses DAGScheduler's failedEpoch and FIXME internal registries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Otherwise, when the executor execId is not in the scheduler:DAGScheduler.md#failedEpoch[list of executor lost] or the executor failure's epoch is smaller than the input maybeEpoch, the executor's lost event is recorded in scheduler:DAGScheduler.md#failedEpoch[failedEpoch internal registry].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME Describe the case above in simpler non-technical words. Perhaps change the order, too.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            You should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            INFO Executor lost: [execId] (epoch [epoch])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            storage:BlockManagerMaster.md#removeExecutor[BlockManagerMaster is requested to remove the lost executor execId].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME Review what's filesLost.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleExecutorLost exits unless the ExecutorLost event was for a map output fetch operation (and the input filesLost is true) or external shuffle service is not used.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In such a case, you should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Shuffle files lost for executor: [execId] (epoch [epoch])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleExecutorLost walks over all scheduler:ShuffleMapStage.md[ShuffleMapStage]s in scheduler:DAGScheduler.md#shuffleToMapStage[DAGScheduler's shuffleToMapStage internal registry] and do the following (in order):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. ShuffleMapStage.removeOutputsOnExecutor(execId) is called
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. scheduler:MapOutputTrackerMaster.md#registerMapOutputs[MapOutputTrackerMaster.registerMapOutputs(shuffleId, stage.outputLocInMapOutputTrackerFormat(), changeEpoch = true)] is called.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In case scheduler:DAGScheduler.md#shuffleToMapStage[DAGScheduler's shuffleToMapStage internal registry] has no shuffles registered, scheduler:MapOutputTrackerMaster.md#incrementEpoch[MapOutputTrackerMaster is requested to increment epoch].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Ultimatelly, DAGScheduler scheduler:DAGScheduler.md#clearCacheLocs[clears the internal cache of RDD partition locations].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleExecutorLost is used when DAGSchedulerEventProcessLoop is requested to handle an ExecutorLost event.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleGetTaskResult","title":"GettingResultEvent Event Handler","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleGetTaskResult(\n  taskInfo: TaskInfo): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleGetTaskResult...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleGetTaskResult is used when DAGSchedulerEventProcessLoop is requested to handle a GettingResultEvent event.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleJobCancellation","title":"JobCancelled Event Handler","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobCancellation(\n  jobId: Int,\n  reason: Option[String]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobCancellation looks up the active job for the input job ID (in jobIdToActiveJob internal registry) and fails it and all associated independent stages with failure reason:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Job [jobId] cancelled [reason]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When the input job ID is not found, handleJobCancellation prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Trying to cancel unregistered job [jobId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobCancellation is used when DAGScheduler is requested to handle a JobCancelled event, doCancelAllJobs, handleJobGroupCancelled, handleStageCancellation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleJobGroupCancelled","title":"JobGroupCancelled Event Handler","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobGroupCancelled(\n  groupId: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobGroupCancelled finds active jobs in a group and cancels them.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Internally, handleJobGroupCancelled computes all the active jobs (registered in the internal collection of active jobs) that have spark.jobGroup.id scheduling property set to groupId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobGroupCancelled then cancels every active job in the group one by one and the cancellation reason:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            part of cancelled job group [groupId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobGroupCancelled is used when DAGScheduler is requested to handle JobGroupCancelled event.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleJobSubmitted","title":"Handling JobSubmitted Event","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobSubmitted(\n  jobId: Int,\n  finalRDD: RDD[_],\n  func: (TaskContext, Iterator[_]) => _,\n  partitions: Array[Int],\n  callSite: CallSite,\n  listener: JobListener,\n  properties: Properties): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobSubmitted creates a ResultStage (finalStage) for the given RDD, func, partitions, jobId and callSite.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BarrierJobSlotsNumberCheckFailed Exception

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Creating a ResultStage may fail with a BarrierJobSlotsNumberCheckFailed exception.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobSubmitted removes the given jobId from the barrierJobIdToNumTasksCheckFailures.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobSubmitted creates an ActiveJob for the ResultStage (with the given jobId, the callSite, the JobListener and the properties).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobSubmitted clears the internal cache of RDD partition locations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            FIXME Why is this clearing here so important?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobSubmitted prints out the following INFO messages to the logs (with missingParentStages):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Got job [id] ([callSite]) with [number] output partitions\nFinal stage: [finalStage] ([name])\nParents of final stage: [parents]\nMissing parents: [missingParentStages]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobSubmitted registers the new ActiveJob in jobIdToActiveJob and activeJobs internal registries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobSubmitted requests the ResultStage to associate itself with the ActiveJob.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobSubmitted uses the jobIdToStageIds internal registry to find all registered stages for the given jobId. handleJobSubmitted uses the stageIdToStage internal registry to request the Stages for the latestInfo.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In the end, handleJobSubmitted posts a SparkListenerJobStart message to the LiveListenerBus and submits the ResultStage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobSubmitted is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGSchedulerEventProcessLoop is requested to handle a JobSubmitted event
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleJobSubmitted-BarrierJobSlotsNumberCheckFailed","title":"BarrierJobSlotsNumberCheckFailed","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In case of a BarrierJobSlotsNumberCheckFailed exception while creating a ResultStage, handleJobSubmitted increments the number of failures in the barrierJobIdToNumTasksCheckFailures for the given jobId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleJobSubmitted prints out the following WARN message to the logs (with spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Barrier stage in job [jobId] requires [requiredConcurrentTasks] slots, but only [maxConcurrentTasks] are available. Will retry up to [maxFailures] more times\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the number of failures is below the spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures threshold, handleJobSubmitted requests the messageScheduler to schedule a one-shot task that requests the DAGSchedulerEventProcessLoop to post a JobSubmitted event (after spark.scheduler.barrier.maxConcurrentTasksCheck.interval seconds).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posting a JobSubmitted event is to request the DAGScheduler to re-consider the request, hoping that there will be enough resources to fulfill the resource requirements of a barrier job.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Otherwise, if the number of failures crossed the spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures threshold, handleJobSubmitted removes the jobId from the barrierJobIdToNumTasksCheckFailures and informs the given JobListener that the jobFailed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleMapStageSubmitted","title":"MapStageSubmitted","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleMapStageSubmitted(\n  jobId: Int,\n  dependency: ShuffleDependency[_, _, _],\n  callSite: CallSite,\n  listener: JobListener,\n  properties: Properties): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            MapStageSubmitted event processing is very similar to JobSubmitted event's.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleMapStageSubmitted finds or creates a new ShuffleMapStage for the given ShuffleDependency and jobId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleMapStageSubmitted creates an ActiveJob (with the given jobId, the ShuffleMapStage, the given JobListener).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleMapStageSubmitted clears the internal cache of RDD partition locations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleMapStageSubmitted prints out the following INFO messages to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Got map stage job [id] ([callSite]) with [number] output partitions\nFinal stage: [stage] ([name])\nParents of final stage: [parents]\nMissing parents: [missingParentStages]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleMapStageSubmitted adds the new ActiveJob to jobIdToActiveJob and activeJobs internal registries, and the ShuffleMapStage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ShuffleMapStage can have multiple ActiveJobs registered.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleMapStageSubmitted finds all the registered stages for the input jobId and collects their latest StageInfo.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In the end, handleMapStageSubmitted posts a SparkListenerJobStart event to the LiveListenerBus and submits the ShuffleMapStage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When the ShuffleMapStage is available already, handleMapStageSubmitted marks the job finished.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When handleMapStageSubmitted could not find or create a ShuffleMapStage, handleMapStageSubmitted prints out the following WARN message to the logs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Creating new stage failed due to exception - job: [id]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleMapStageSubmitted notifies the JobListener about the job failure and exits.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleMapStageSubmitted is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGSchedulerEventProcessLoop is requested to handle a MapStageSubmitted event
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#resubmitFailedStages","title":"ResubmitFailedStages Event Handler","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            resubmitFailedStages(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            resubmitFailedStages iterates over the internal collection of failed stages and submits them.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            resubmitFailedStages does nothing when there are no failed stages reported.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            resubmitFailedStages prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Resubmitting failed stages\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            resubmitFailedStages clears the internal cache of RDD partition locations and makes a copy of the collection of failed stages to track failed stages afresh.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            At this point DAGScheduler has no failed stages reported.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The previously-reported failed stages are sorted by the corresponding job ids in incremental order and resubmitted.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            resubmitFailedStages is used when DAGSchedulerEventProcessLoop is requested to handle a ResubmitFailedStages event.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleSpeculativeTaskSubmitted","title":"SpeculativeTaskSubmitted Event Handler","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleSpeculativeTaskSubmitted(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleSpeculativeTaskSubmitted...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleSpeculativeTaskSubmitted is used when DAGSchedulerEventProcessLoop is requested to handle a SpeculativeTaskSubmitted event.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleStageCancellation","title":"StageCancelled Event Handler","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleStageCancellation(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleStageCancellation...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleStageCancellation is used when DAGSchedulerEventProcessLoop is requested to handle a StageCancelled event.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleTaskSetFailed","title":"TaskSetFailed Event Handler","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskSetFailed(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskSetFailed...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleTaskSetFailed is used when DAGSchedulerEventProcessLoop is requested to handle a TaskSetFailed event.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#handleWorkerRemoved","title":"WorkerRemoved Event Handler","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleWorkerRemoved(\n  workerId: String,\n  host: String,\n  message: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleWorkerRemoved...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            handleWorkerRemoved is used when DAGSchedulerEventProcessLoop is requested to handle a WorkerRemoved event.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#internal-properties","title":"Internal Properties","text":""},{"location":"scheduler/DAGScheduler/#failedEpoch","title":"failedEpoch","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The lookup table of lost executors and the epoch of the event.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#failedStages","title":"failedStages","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Stages that failed due to fetch failures (when a DAGSchedulerEventProcessLoop.md#handleTaskCompletion-FetchFailed[task fails with FetchFailed exception]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#jobIdToActiveJob","title":"jobIdToActiveJob","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The lookup table of ActiveJobs per job id.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#jobIdToStageIds","title":"jobIdToStageIds","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The lookup table of all stages per ActiveJob id

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#nextJobId","title":"nextJobId Counter","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            nextJobId: AtomicInteger\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            nextJobId is a Java AtomicInteger for job IDs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            nextJobId starts at 0.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when DAGScheduler is requested for numTotalJobs, to submitJob, runApproximateJob and submitMapStage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#nextStageId","title":"nextStageId","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The next stage id counting from 0.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when DAGScheduler creates a shuffle map stage and a result stage. It is the key in stageIdToStage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#runningStages","title":"runningStages","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The set of stages that are currently \"running\".

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A stage is added when submitMissingTasks gets executed (without first checking if the stage has not already been added).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#shuffleIdToMapStage","title":"shuffleIdToMapStage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A lookup table of ShuffleMapStages by ShuffleDependency

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#stageIdToStage","title":"stageIdToStage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A lookup table of stages by stage ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when DAGScheduler creates a shuffle map stage, creates a result stage, cleans up job state and independent stages, is informed that a task is started, a taskset has failed, a job is submitted (to compute a ResultStage), a map stage was submitted, a task has completed or a stage was cancelled, updates accumulators, aborts a stage and fails a job and independent stages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#waitingStages","title":"waitingStages","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Stages with parents to be computed

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#event-posting-methods","title":"Event Posting Methods","text":""},{"location":"scheduler/DAGScheduler/#cancelAllJobs","title":"Posting AllJobsCancelled","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posts an AllJobsCancelled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when SparkContext is requested to cancel all running or scheduled Spark jobs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#cancelJob","title":"Posting JobCancelled","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posts a JobCancelled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when SparkContext or JobWaiter are requested to cancel a Spark job

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#cancelJobGroup","title":"Posting JobGroupCancelled","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posts a JobGroupCancelled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when SparkContext is requested to cancel a job group

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#cancelStage","title":"Posting StageCancelled","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posts a StageCancelled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when SparkContext is requested to cancel a stage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#executorAdded","title":"Posting ExecutorAdded","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posts an ExecutorAdded

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when TaskSchedulerImpl is requested to handle resource offers (and a new executor is found in the resource offers)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#executorLost","title":"Posting ExecutorLost","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posts a ExecutorLost

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when TaskSchedulerImpl is requested to handle a task status update (and a task gets lost which is used to indicate that the executor got broken and hence should be considered lost) or executorLost

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#runApproximateJob","title":"Posting JobSubmitted","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posts a JobSubmitted

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when SparkContext is requested to run an approximate job

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#speculativeTaskSubmitted","title":"Posting SpeculativeTaskSubmitted","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posts a SpeculativeTaskSubmitted

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when TaskSetManager is requested to checkAndSubmitSpeculatableTask

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#taskEnded","title":"Posting CompletionEvent","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posts a CompletionEvent

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when TaskSetManager is requested to handleSuccessfulTask, handleFailedTask, and executorLost

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#taskGettingResult","title":"Posting GettingResultEvent","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posts a GettingResultEvent

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when TaskSetManager is requested to handle a task fetching result

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#taskSetFailed","title":"Posting TaskSetFailed","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posts a TaskSetFailed

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when TaskSetManager is requested to abort

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#taskStarted","title":"Posting BeginEvent","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posts a BeginEvent

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when TaskSetManager is requested to start a task

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#workerRemoved","title":"Posting WorkerRemoved","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posts a WorkerRemoved

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when TaskSchedulerImpl is requested to handle a removed worker event

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#updateAccumulators","title":"Updating Accumulators of Completed Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            updateAccumulators(\n  event: CompletionEvent): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            updateAccumulators merges the partial values of accumulators from a completed task (based on the given CompletionEvent) into their \"source\" accumulators on the driver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For every AccumulatorV2 update (in the given CompletionEvent), updateAccumulators finds the corresponding accumulator on the driver and requests the AccumulatorV2 to merge the updates.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            updateAccumulators...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For named accumulators with the update value being a non-zero value, i.e. not Accumulable.zero:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • stage.latestInfo.accumulables for the AccumulableInfo.id is set
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • CompletionEvent.taskInfo.accumulables has a new AccumulableInfo added.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME Where are Stage.latestInfo.accumulables and CompletionEvent.taskInfo.accumulables used?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            updateAccumulators is used when DAGScheduler is requested to handle a task completion.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#postTaskEnd","title":"Posting SparkListenerTaskEnd (at Task Completion)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            postTaskEnd(\n  event: CompletionEvent): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            postTaskEnd reconstructs task metrics (from the accumulator updates in the CompletionEvent).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In the end, postTaskEnd creates a SparkListenerTaskEnd and requests the LiveListenerBus to post it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            postTaskEnd is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to handle a task completion
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#checkBarrierStageWithNumSlots","title":"checkBarrierStageWithNumSlots
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            checkBarrierStageWithNumSlots(\n  rdd: RDD[_],\n  rp: ResourceProfile): Unit\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Noop for Non-Barrier RDDs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Unless the given RDD is isBarrier, checkBarrierStageWithNumSlots does nothing (is a noop).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            checkBarrierStageWithNumSlots requests the given RDD for the number of partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            checkBarrierStageWithNumSlots requests the SparkContext for the maximum number of concurrent tasks for the given ResourceProfile.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the number of partitions (based on the RDD) is greater than the maximum number of concurrent tasks (based on the ResourceProfile), checkBarrierStageWithNumSlots reports a BarrierJobSlotsNumberCheckFailed exception.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            checkBarrierStageWithNumSlots is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to create a ShuffleMapStage or a ResultStage stage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#utilities","title":"Utilities

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Danger

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The section includes (hides) utility methods that do not really contribute to the understanding of how DAGScheduler works internally.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            It's very likely they should not even be part of this page.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGScheduler/#getShuffleDependenciesAndResourceProfiles","title":"Finding Shuffle Dependencies and ResourceProfiles of RDD","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getShuffleDependenciesAndResourceProfiles(\n  rdd: RDD[_]): (HashSet[ShuffleDependency[_, _, _]], HashSet[ResourceProfile])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getShuffleDependenciesAndResourceProfiles returns the direct ShuffleDependencies and all the ResourceProfiles of the given RDD and parent non-shuffle RDDs, if available.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getShuffleDependenciesAndResourceProfiles collects ResourceProfiles of the given RDD and any parent RDDs, if available.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getShuffleDependenciesAndResourceProfiles collects direct ShuffleDependencies of the given RDD and any parent RDDs of non-ShuffleDependencyies, if available.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getShuffleDependenciesAndResourceProfiles is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to create a ShuffleMapStage and a ResultStage, and for the missing ShuffleDependencies of a RDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGScheduler/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Enable ALL logging level for org.apache.spark.scheduler.DAGScheduler logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            logger.DAGScheduler.name = org.apache.spark.scheduler.DAGScheduler\nlogger.DAGScheduler.level = all\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGSchedulerEvent/","title":"DAGSchedulerEvent","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DAGSchedulerEvent is an abstraction of events that are handled by the DAGScheduler (on dag-scheduler-event-loop daemon thread).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/DAGSchedulerEvent/#alljobscancelled","title":"AllJobsCancelled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Carries no extra information

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posted when DAGScheduler is requested to cancelAllJobs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Event handler: doCancelAllJobs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGSchedulerEvent/#beginevent","title":"BeginEvent

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskInfo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posted when DAGScheduler is requested to taskStarted

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Event handler: handleBeginEvent

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/DAGSchedulerEvent/#completionevent","title":"CompletionEvent

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Completed Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskEndReason
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Result (value computed)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • AccumulatorV2 Updates
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Metric Peaks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskInfo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Posted when DAGScheduler is requested to taskEnded

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Event handler: handleTaskCompletion

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/DAGSchedulerEvent/#executoradded","title":"ExecutorAdded

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Host name

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Posted when DAGScheduler is requested to executorAdded

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Event handler: handleExecutorAdded

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/DAGSchedulerEvent/#executorlost","title":"ExecutorLost

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Reason

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Posted when DAGScheduler is requested to executorLost

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Event handler: handleExecutorLost

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/DAGSchedulerEvent/#gettingresultevent","title":"GettingResultEvent

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • TaskInfo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Posted when DAGScheduler is requested to taskGettingResult

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Event handler: handleGetTaskResult

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/DAGSchedulerEvent/#jobcancelled","title":"JobCancelled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              JobCancelled event carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Job ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Reason (optional)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Posted when DAGScheduler is requested to cancelJob

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Event handler: handleJobCancellation

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/DAGSchedulerEvent/#jobgroupcancelled","title":"JobGroupCancelled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Group ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Posted when DAGScheduler is requested to cancelJobGroup

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Event handler: handleJobGroupCancelled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/DAGSchedulerEvent/#jobsubmitted","title":"JobSubmitted

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Job ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • RDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Partition processing function (with a TaskContext and the partition data, i.e. (TaskContext, Iterator[_]) => _)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Partition IDs to compute
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • CallSite
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • JobListener to keep updated about the status of the stage execution
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Execution properties

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Posted when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • DAGScheduler is requested to submit a job, run an approximate job and handleJobSubmitted

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Event handler: handleJobSubmitted

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/DAGSchedulerEvent/#mapstagesubmitted","title":"MapStageSubmitted

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Job ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ShuffleDependency
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • CallSite
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • JobListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Execution properties

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Posted when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • DAGScheduler is requested to submit a MapStage for execution

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Event handler: handleMapStageSubmitted

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/DAGSchedulerEvent/#resubmitfailedstages","title":"ResubmitFailedStages

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Carries no extra information.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Posted when DAGScheduler is requested to handleTaskCompletion

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Event handler: resubmitFailedStages

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/DAGSchedulerEvent/#shufflemergefinalized","title":"ShuffleMergeFinalized

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ShuffleMapStage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Posted when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • DAGScheduler is requested to finalizeShuffleMerge

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Event handler: handleShuffleMergeFinalized

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/DAGSchedulerEvent/#speculativetasksubmitted","title":"SpeculativeTaskSubmitted

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Task

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Posted when DAGScheduler is requested to speculativeTaskSubmitted

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Event handler: handleSpeculativeTaskSubmitted

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/DAGSchedulerEvent/#stagecancelled","title":"StageCancelled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Stage ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Reason (optional)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Posted when DAGScheduler is requested to cancelStage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Event handler: handleStageCancellation

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/DAGSchedulerEvent/#tasksetfailed","title":"TaskSetFailed

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • TaskSet
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Reason
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Exception (optional)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Posted when DAGScheduler is requested to taskSetFailed

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Event handler: handleTaskSetFailed

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/DAGSchedulerEvent/#workerremoved","title":"WorkerRemoved

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Worked ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Host name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Reason

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Posted when DAGScheduler is requested to workerRemoved

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Event handler: handleWorkerRemoved

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/DAGSchedulerEventProcessLoop/","title":"DAGSchedulerEventProcessLoop","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              DAGSchedulerEventProcessLoop is an event processing daemon thread to handle DAGSchedulerEvents (on a separate thread from the parent DAGScheduler's).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              DAGSchedulerEventProcessLoop is registered under the name of dag-scheduler-event-loop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              DAGSchedulerEventProcessLoop uses java.util.concurrent.LinkedBlockingDeque blocking deque that can grow indefinitely.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"scheduler/DAGSchedulerEventProcessLoop/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              DAGSchedulerEventProcessLoop takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • DAGScheduler

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DAGSchedulerEventProcessLoop is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • DAGScheduler is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/DAGSchedulerEventProcessLoop/#processing-event","title":"Processing Event DAGSchedulerEvent Event Handler AllJobsCancelled doCancelAllJobs BeginEvent handleBeginEvent CompletionEvent handleTaskCompletion ExecutorAdded handleExecutorAdded ExecutorLost handleExecutorLost GettingResultEvent handleGetTaskResult JobCancelled handleJobCancellation JobGroupCancelled handleJobGroupCancelled JobSubmitted handleJobSubmitted MapStageSubmitted handleMapStageSubmitted ResubmitFailedStages resubmitFailedStages SpeculativeTaskSubmitted handleSpeculativeTaskSubmitted StageCancelled handleStageCancellation TaskSetFailed handleTaskSetFailed WorkerRemoved handleWorkerRemoved","text":""},{"location":"scheduler/DAGSchedulerEventProcessLoop/#shufflemergefinalized","title":"ShuffleMergeFinalized
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Event: ShuffleMergeFinalized
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Event handler: handleShuffleMergeFinalized
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGSchedulerEventProcessLoop/#messageprocessingtime-timer","title":"messageProcessingTime Timer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DAGSchedulerEventProcessLoop uses messageProcessingTime timer to measure time of processing events.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/DAGSchedulerSource/","title":"DAGSchedulerSource","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DAGSchedulerSource is the metrics source of DAGScheduler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The name of the source is DAGScheduler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DAGSchedulerSource emits the following metrics:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • stage.failedStages - the number of failed stages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • stage.runningStages - the number of running stages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • stage.waitingStages - the number of waiting stages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • job.allJobs - the number of all jobs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • job.activeJobs - the number of active jobs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/DriverEndpoint/","title":"DriverEndpoint","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DriverEndpoint is a ThreadSafeRpcEndpoint that is a message handler for CoarseGrainedSchedulerBackend to communicate with CoarseGrainedExecutorBackend.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DriverEndpoint is registered under the name CoarseGrainedScheduler by CoarseGrainedSchedulerBackend.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DriverEndpoint uses executorDataMap internal registry of all the executors that registered with the driver. An executor sends a RegisterExecutor message to inform that it wants to register.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/DriverEndpoint/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DriverEndpoint takes no arguments to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DriverEndpoint is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • CoarseGrainedSchedulerBackend is created (and registers a CoarseGrainedScheduler RPC endpoint)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/DriverEndpoint/#executorlogurlhandler","title":"ExecutorLogUrlHandler
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                logUrlHandler: ExecutorLogUrlHandler\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DriverEndpoint creates an ExecutorLogUrlHandler (based on spark.ui.custom.executor.log.url configuration property) when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DriverEndpoint uses the ExecutorLogUrlHandler to create an ExecutorData when requested to handle a RegisterExecutor message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/DriverEndpoint/#onStart","title":"Starting DriverEndpoint RpcEndpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                onStart(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                onStart is part of the RpcEndpoint abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                onStart requests the Revive Messages Scheduler Service to schedule a periodic action that sends ReviveOffers messages every revive interval (based on spark.scheduler.revive.interval configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/DriverEndpoint/#makeOffers","title":"Launching Tasks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                There are two makeOffers methods to launch tasks that differ by the number of active executor (from the executorDataMap registry) they work with:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • All Active Executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Single Executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/DriverEndpoint/#on-all-active-executors","title":"On All Active Executors","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                makeOffers(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                makeOffers builds WorkerOffers for every active executor (in the executorDataMap registry) and requests the TaskSchedulerImpl to generate tasks for the available worker offers (that creates TaskDescriptions).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                With tasks (TaskDescriptions) to be launched, makeOffers launches them.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                makeOffers is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • DriverEndpoint handles ReviveOffers messages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/DriverEndpoint/#on-single-executor","title":"On Single Executor","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                makeOffers(\n  executorId: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                makeOffers with a single executor is makeOffers for all active executors for just one executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                makeOffers is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • DriverEndpoint handles StatusUpdate and LaunchedExecutor messages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/DriverEndpoint/#launchTasks","title":"Launching Tasks","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                launchTasks(\n  tasks: Seq[Seq[TaskDescription]]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The input tasks collection contains one or more TaskDescriptions per executor (and the \"task partitioning\" per executor is of no use in launchTasks so it simply flattens the input data structure).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For every TaskDescription (in the given tasks collection), launchTasks encodes it and makes sure that the encoded task size is below the allowed message size.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                launchTasks looks up the ExecutorData of the executor that has been assigned to execute the task (in executorDataMap internal registry) and decreases the executor's free cores (based on spark.task.cpus configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Scheduling in Spark relies on cores only (not memory), i.e. the number of tasks Spark can run on an executor is limited by the number of cores available only. When submitting a Spark application for execution both executor resources -- memory and cores -- can however be specified explicitly. It is the job of a cluster manager to monitor the memory and take action when its use exceeds what was assigned.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                launchTasks prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Launching task [taskId] on executor id: [executorId] hostname: [executorHost].\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In the end, launchTasks sends the (serialized) task to the executor (by sending a LaunchTask message to the executor's RPC endpoint with the serialized task insize SerializableBuffer).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                This is the moment in a task's lifecycle when the driver sends the serialized task to an assigned executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/DriverEndpoint/#task-exceeds-allowed-size","title":"Task Exceeds Allowed Size

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In case the size of a serialized TaskDescription equals or exceeds the maximum allowed RPC message size, launchTasks looks up the TaskSetManager for the TaskDescription (in taskIdToTaskSetManager registry) and aborts it with the following message:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Serialized task [id]:[index] was [limit] bytes, which exceeds max allowed: spark.rpc.message.maxSize ([maxRpcMessageSize] bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/DriverEndpoint/#messages","title":"Messages","text":""},{"location":"scheduler/DriverEndpoint/#killexecutorsonhost","title":"KillExecutorsOnHost

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CoarseGrainedSchedulerBackend is requested to kill all executors on a node

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/DriverEndpoint/#killtask","title":"KillTask

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CoarseGrainedSchedulerBackend is requested to kill a task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                KillTask(\n  taskId: Long,\n  executor: String,\n  interruptThread: Boolean)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                KillTask is sent when CoarseGrainedSchedulerBackend kills a task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When KillTask is received, DriverEndpoint finds executor (in executorDataMap registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                If found, DriverEndpoint passes the message on to the executor (using its registered RPC endpoint for CoarseGrainedExecutorBackend).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Otherwise, you should see the following WARN in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Attempted to kill task [taskId] for unknown executor [executor].\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/DriverEndpoint/#launchedexecutor","title":"LaunchedExecutor","text":""},{"location":"scheduler/DriverEndpoint/#registerexecutor","title":"RegisterExecutor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CoarseGrainedExecutorBackend registers with the driver

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RegisterExecutor(\n  executorId: String,\n  executorRef: RpcEndpointRef,\n  hostname: String,\n  cores: Int,\n  logUrls: Map[String, String])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RegisterExecutor is sent when CoarseGrainedExecutorBackend RPC Endpoint is requested to start.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When received, DriverEndpoint makes sure that no other executors were registered under the input executorId and that the input hostname is not blacklisted.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                If the requirements hold, you should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Registered executor [executorRef] ([address]) with ID [executorId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DriverEndpoint does the bookkeeping:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Registers executorId (in addressToExecutorId)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Adds cores (in totalCoreCount)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Increments totalRegisteredExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Creates and registers ExecutorData for executorId (in executorDataMap)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Updates currentExecutorIdCounter if the input executorId is greater than the current value.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                If numPendingExecutors is greater than 0, you should see the following DEBUG message in the logs and DriverEndpoint decrements numPendingExecutors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Decremented number of pending executors ([numPendingExecutors] left)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DriverEndpoint sends RegisteredExecutor message back (that is to confirm that the executor was registered successfully).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DriverEndpoint replies true (to acknowledge the message).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DriverEndpoint then announces the new executor by posting SparkListenerExecutorAdded to LiveListenerBus.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In the end, DriverEndpoint makes executor resource offers (for launching tasks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                If however there was already another executor registered under the input executorId, DriverEndpoint sends RegisterExecutorFailed message back with the reason:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Duplicate executor ID: [executorId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                If however the input hostname is blacklisted, you should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Rejecting [executorId] as it has been blacklisted.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DriverEndpoint sends RegisterExecutorFailed message back with the reason:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Executor is blacklisted: [executorId]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/DriverEndpoint/#removeexecutor","title":"RemoveExecutor","text":""},{"location":"scheduler/DriverEndpoint/#removeworker","title":"RemoveWorker","text":""},{"location":"scheduler/DriverEndpoint/#retrievesparkappconfig","title":"RetrieveSparkAppConfig
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                RetrieveSparkAppConfig(\n  resourceProfileId: Int)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Posted when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • CoarseGrainedExecutorBackend standalone application is started

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When received, DriverEndpoint replies with a SparkAppConfig message with the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                1. spark-prefixed configuration properties
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2. IO Encryption Key
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                3. Delegation tokens
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                4. Default profile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/DriverEndpoint/#reviveoffers","title":"ReviveOffers

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Posted when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Periodically (every spark.scheduler.revive.interval) right after DriverEndpoint is requested to start
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • CoarseGrainedSchedulerBackend is requested to revive resource offers

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When received, DriverEndpoint makes executor resource offers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/DriverEndpoint/#statusupdate","title":"StatusUpdate

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CoarseGrainedExecutorBackend sends task status updates to the driver

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                StatusUpdate(\n  executorId: String,\n  taskId: Long,\n  state: TaskState,\n  data: SerializableBuffer)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                StatusUpdate is sent when CoarseGrainedExecutorBackend sends task status updates to the driver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When StatusUpdate is received, DriverEndpoint requests the TaskSchedulerImpl to handle the task status update.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                If the task has finished, DriverEndpoint updates the number of cores available for work on the corresponding executor (registered in executorDataMap).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DriverEndpoint makes an executor resource offer on the single executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When DriverEndpoint found no executor (in executorDataMap), you should see the following WARN message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Ignored task status update ([taskId] state [state]) from unknown executor with ID [executorId]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/DriverEndpoint/#stopdriver","title":"StopDriver","text":""},{"location":"scheduler/DriverEndpoint/#stopexecutors","title":"StopExecutors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                StopExecutors message is receive-reply and blocking. When received, the following INFO message appears in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Asking each executor to shut down\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                It then sends a StopExecutor message to every registered executor (from executorDataMap).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/DriverEndpoint/#updatedelegationtokens","title":"UpdateDelegationTokens","text":""},{"location":"scheduler/DriverEndpoint/#removing-executor","title":"Removing Executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                removeExecutor(\n  executorId: String,\n  reason: ExecutorLossReason): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When removeExecutor is executed, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Asked to remove executor [executorId] with reason [reason]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                removeExecutor then tries to find the executorId executor (in executorDataMap internal registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                If the executorId executor was found, removeExecutor removes the executor from the following registries:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • addressToExecutorId
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • executorDataMap
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • <>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • executorsPendingToRemove
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • removeExecutor decrements:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • totalCoreCount by the executor's totalCores
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • totalRegisteredExecutors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, removeExecutor notifies TaskSchedulerImpl that an executor was lost.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  removeExecutor posts SparkListenerExecutorRemoved to LiveListenerBus (with the executorId executor).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If however the executorId executor could not be found, removeExecutor requests BlockManagerMaster to remove the executor asynchronously.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  removeExecutor uses SparkEnv to access the current BlockManager and then BlockManagerMaster.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  You should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Asked to remove non-existent executor [executorId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  removeExecutor is used when DriverEndpoint handles RemoveExecutor message and gets disassociated with a remote RPC endpoint of an executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DriverEndpoint/#removing-worker","title":"Removing Worker
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  removeWorker(\n  workerId: String,\n  host: String,\n  message: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  removeWorker prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Asked to remove worker [workerId] with reason [message]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, removeWorker simply requests the TaskSchedulerImpl to workerRemoved.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  removeWorker is used when DriverEndpoint is requested to handle a RemoveWorker event.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DriverEndpoint/#processing-one-way-messages","title":"Processing One-Way Messages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  receive: PartialFunction[Any, Unit]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  receive is part of the RpcEndpoint abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  receive...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DriverEndpoint/#processing-two-way-messages","title":"Processing Two-Way Messages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  receiveAndReply(\n  context: RpcCallContext): PartialFunction[Any, Unit]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  receiveAndReply is part of the RpcEndpoint abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  receiveAndReply...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DriverEndpoint/#ondisconnected-callback","title":"onDisconnected Callback

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  onDisconnected removes the worker from the internal addressToExecutorId registry (that effectively removes the worker from a cluster).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  onDisconnected removes the executor with the reason being SlaveLost and message:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DriverEndpoint/#executors-by-rpcaddress-registry","title":"Executors by RpcAddress Registry
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  addressToExecutorId: Map[RpcAddress, String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Executor addresses (host and port) for executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Set when an executor connects to register itself.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DriverEndpoint/#disabling-executor","title":"Disabling Executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  disableExecutor(\n  executorId: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  disableExecutor checks whether the executor is active:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • If so, disableExecutor adds the executor to the executorsPendingLossReason registry
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Otherwise, disableExecutor checks whether added to executorsPendingToRemove registry

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  disableExecutor determines whether the executor should really be disabled (as active or registered in executorsPendingToRemove registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If the executor should be disabled, disableExecutor prints out the following INFO message to the logs and notifies the TaskSchedulerImpl that the executor is lost.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Disabling executor [executorId].\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  disableExecutor returns the indication whether the executor should have been disabled or not.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  disableExecutor is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • KubernetesDriverEndpoint is requested to handle onDisconnected event
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • YarnDriverEndpoint is requested to handle onDisconnected event
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/DriverEndpoint/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Enable ALL logging level for org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.DriverEndpoint logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  log4j.logger.org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.DriverEndpoint=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/ExecutorData/","title":"ExecutorData","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorData is a metadata of an executor:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor's RPC Endpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor's RpcAddress
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor's Host
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor's Free Cores
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor's Total Cores
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor's Log URLs (Map[String, String])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor's Attributes (Map[String, String])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor's Resources Info (Map[String, ExecutorResourceInfo])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor's ResourceProfile ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorData is created for every executor registered (when DriverEndpoint is requested to handle a RegisterExecutor message).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorData is used by CoarseGrainedSchedulerBackend to track registered executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExecutorData is posted as part of SparkListenerExecutorAdded event by DriverEndpoint every time an executor is registered.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/ExternalClusterManager/","title":"ExternalClusterManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExternalClusterManager is an abstraction of pluggable cluster managers that can create a SchedulerBackend and TaskScheduler for a given master URL (when SparkContext is created).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The support for pluggable cluster managers was introduced in SPARK-13904 Add support for pluggable cluster manager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExternalClusterManager can be registered using the java.util.ServiceLoader mechanism (with service markers under META-INF/services directory).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/ExternalClusterManager/#contract","title":"Contract","text":""},{"location":"scheduler/ExternalClusterManager/#checking-support-for-master-url","title":"Checking Support for Master URL
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    canCreate(\n  masterURL: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Checks whether this cluster manager instance can create scheduler components for a given master URL

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when SparkContext is created (and requested for a cluster manager)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/ExternalClusterManager/#creating-schedulerbackend","title":"Creating SchedulerBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    createSchedulerBackend(\n  sc: SparkContext,\n  masterURL: String,\n  scheduler: TaskScheduler): SchedulerBackend\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Creates a SchedulerBackend for a given SparkContext, master URL, and TaskScheduler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when SparkContext is created (and requested for a SchedulerBackend and TaskScheduler)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/ExternalClusterManager/#creating-taskscheduler","title":"Creating TaskScheduler
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    createTaskScheduler(\n  sc: SparkContext,\n  masterURL: String): TaskScheduler\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Creates a TaskScheduler for a given SparkContext and master URL

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when SparkContext is created (and requested for a SchedulerBackend and TaskScheduler)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/ExternalClusterManager/#initializing-scheduling-components","title":"Initializing Scheduling Components
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    initialize(\n  scheduler: TaskScheduler,\n  backend: SchedulerBackend): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Initializes the TaskScheduler and SchedulerBackend

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when SparkContext is created (and requested for a SchedulerBackend and TaskScheduler)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/ExternalClusterManager/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • KubernetesClusterManager (Spark on Kubernetes)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • MesosClusterManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • YarnClusterManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/FIFOSchedulableBuilder/","title":"FIFOSchedulableBuilder","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    == FIFOSchedulableBuilder - SchedulableBuilder for FIFO Scheduling Mode

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    FIFOSchedulableBuilder is a <> that holds a single spark-scheduler-Pool.md[Pool] (that is given when <FIFOSchedulableBuilder is created>>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NOTE: FIFOSchedulableBuilder is the scheduler:TaskSchedulerImpl.md#creating-instance[default SchedulableBuilder for TaskSchedulerImpl].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NOTE: When FIFOSchedulableBuilder is created, the TaskSchedulerImpl passes its own rootPool (a part of scheduler:TaskScheduler.md#contract[TaskScheduler Contract]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    FIFOSchedulableBuilder obeys the <> as follows:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • <> does nothing.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • addTaskSetManager spark-scheduler-Pool.md#addSchedulable[passes the input Schedulable to the one and only rootPool Pool (using addSchedulable)] and completely disregards the properties of the Schedulable.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • === [[creating-instance]] Creating FIFOSchedulableBuilder Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      FIFOSchedulableBuilder takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • [[rootPool]] rootPool spark-scheduler-Pool.md[Pool]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/FairSchedulableBuilder/","title":"FairSchedulableBuilder","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      FairSchedulableBuilder is a <> that is <> exclusively for scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl] for FAIR scheduling mode (when configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property is FAIR).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [[creating-instance]] FairSchedulableBuilder takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • [[rootPool]] <>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • [[conf]] SparkConf.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Once <>, TaskSchedulerImpl requests the FairSchedulableBuilder to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        [[DEFAULT_SCHEDULER_FILE]] FairSchedulableBuilder uses the pools defined in an <> that is assumed to be the value of the configuration-properties.md#spark.scheduler.allocation.file[spark.scheduler.allocation.file] configuration property or the default fairscheduler.xml (that is <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TIP: Use conf/fairscheduler.xml.template as a template for the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        [[DEFAULT_POOL_NAME]] FairSchedulableBuilder always has the default pool defined (and <> unless done in the <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        [[FAIR_SCHEDULER_PROPERTIES]] [[spark.scheduler.pool]] FairSchedulableBuilder uses spark.scheduler.pool local property for the name of the pool to use when requested to <> (default: <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        SparkContext.setLocalProperty lets you set local properties per thread to group jobs in logical groups, e.g. to allow FairSchedulableBuilder to use spark.scheduler.pool property and to group jobs from different threads to be submitted for execution on a non-<> pool."},{"location":"scheduler/FairSchedulableBuilder/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> :type sc org.apache.spark.SparkContext

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        sc.setLocalProperty(\"spark.scheduler.pool\", \"production\")

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/FairSchedulableBuilder/#whatever-is-executed-afterwards-is-submitted-to-production-pool","title":"// whatever is executed afterwards is submitted to production pool","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        [[logging]] [TIP] ==== Enable ALL logging level for org.apache.spark.scheduler.FairSchedulableBuilder logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        log4j.logger.org.apache.spark.scheduler.FairSchedulableBuilder=ALL\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/FairSchedulableBuilder/#refer-to","title":"Refer to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[allocations-file]] Allocation Pools Configuration File

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The allocation pools configuration file is an XML file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The default conf/fairscheduler.xml.template is as follows:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/FairSchedulableBuilder/#source-xml","title":"[source, xml]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        FAIR 1 2 FIFO 2 3

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TIP: The top-level element's name allocations can be anything. Spark does not insist on allocations and accepts any name.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[buildPools]] Building (Tree of) Pools of Schedulables -- buildPools Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/FairSchedulableBuilder/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/FairSchedulableBuilder/#buildpools-unit","title":"buildPools(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: buildPools is part of the <> to build a tree of <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        buildPools <> if available and then <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        buildPools prints out the following INFO message to the logs when the configuration file (per the configuration-properties.md#spark.scheduler.allocation.file[spark.scheduler.allocation.file] configuration property) could be read:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Creating Fair Scheduler pools from [file]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        buildPools prints out the following INFO message to the logs when the configuration-properties.md#spark.scheduler.allocation.file[spark.scheduler.allocation.file] configuration property was not used to define the configuration file and the <> is used instead:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Creating Fair Scheduler pools from default file: [DEFAULT_SCHEDULER_FILE]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When neither configuration-properties.md#spark.scheduler.allocation.file[spark.scheduler.allocation.file] configuration property nor the <> could be used, buildPools prints out the following WARN message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in [DEFAULT_SCHEDULER_FILE] or set spark.scheduler.allocation.file to a file that contains the configuration.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[addTaskSetManager]] addTaskSetManager Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/FairSchedulableBuilder/#source-scala_2","title":"[source, scala]","text":""},{"location":"scheduler/FairSchedulableBuilder/#addtasksetmanagermanager-schedulable-properties-properties-unit","title":"addTaskSetManager(manager: Schedulable, properties: Properties): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: addTaskSetManager is part of the <> to register a new <> with the <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        addTaskSetManager finds the pool by name (in the given Properties) under the <> property or defaults to the <> pool if undefined.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        addTaskSetManager then requests the <> to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Unless found, addTaskSetManager creates a new <> with the <> (as if the <> pool were used) and requests the <> to <>. In the end, addTaskSetManager prints out the following WARN message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        A job was submitted with scheduler pool [poolName], which has not been configured. This can happen when the file that pools are read from isn't set, or when that file doesn't contain [poolName]. Created [poolName] with default configuration (schedulingMode: [mode], minShare: [minShare], weight: [weight])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        addTaskSetManager then requests the pool (found or newly-created) to <> the given <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the end, addTaskSetManager prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Added task set [name] tasks to pool [poolName]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[buildDefaultPool]] Registering Default Pool -- buildDefaultPool Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/FairSchedulableBuilder/#source-scala_3","title":"[source, scala]","text":""},{"location":"scheduler/FairSchedulableBuilder/#builddefaultpool-unit","title":"buildDefaultPool(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        buildDefaultPool requests the <> to <> (one with the <> name).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Unless already available, buildDefaultPool creates a <> with the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • <> pool name

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • FIFO scheduling mode

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • 0 for the initial minimum share

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • 1 for the initial weight

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • In the end, buildDefaultPool requests the <> to <> followed by the INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Created default pool: [name], schedulingMode: [mode], minShare: [minShare], weight: [weight]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: buildDefaultPool is used exclusively when FairSchedulableBuilder is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[buildFairSchedulerPool]] Building Pools from XML Allocations File -- buildFairSchedulerPool Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/FairSchedulableBuilder/#source-scala_4","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          buildFairSchedulerPool( is: InputStream, fileName: String): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          buildFairSchedulerPool starts by loading the XML file from the given InputStream.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          For every pool element, buildFairSchedulerPool creates a <> with the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Pool name per name attribute

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Scheduling mode per schedulingMode element (case-insensitive with FIFO as the default)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Initial minimum share per minShare element (default: 0)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Initial weight per weight element (default: 1)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In the end, buildFairSchedulerPool requests the <> to <> followed by the INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Created pool: [name], schedulingMode: [mode], minShare: [minShare], weight: [weight]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: buildFairSchedulerPool is used exclusively when FairSchedulableBuilder is requested to <>."},{"location":"scheduler/HighlyCompressedMapStatus/","title":"HighlyCompressedMapStatus","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          HighlyCompressedMapStatus is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/JobListener/","title":"JobListener","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          JobListener is an abstraction of listeners that listen for job completion or failure events (after submitting a job to the DAGScheduler).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/JobListener/#contract","title":"Contract","text":""},{"location":"scheduler/JobListener/#tasksucceeded","title":"taskSucceeded
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          taskSucceeded(\n  index: Int,\n  result: Any): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when DAGScheduler is requested to handleTaskCompletion or markMapStageJobAsFinished

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/JobListener/#jobfailed","title":"jobFailed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          jobFailed(\n  exception: Exception): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when DAGScheduler is requested to cleanUpAfterSchedulerStop, handleJobSubmitted, handleMapStageSubmitted, handleTaskCompletion or failJobAndIndependentStages

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/JobListener/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ApproximateActionListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • JobWaiter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/JobWaiter/","title":"JobWaiter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          JobWaiter is a JobListener to listen to task events and to know when all have finished successfully or not.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/JobWaiter/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          JobWaiter takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • DAGScheduler
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Job ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Total number of tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Result Handler Function ((Int, T) => Unit)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            JobWaiter is created\u00a0when DAGScheduler is requested to submit a job or a map stage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/JobWaiter/#scala-promise","title":"Scala Promise
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            jobPromise: Promise[Unit]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            jobPromise is a Scala Promise that is completed when all tasks have finished successfully or failed with an exception.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/JobWaiter/#tasksucceeded","title":"taskSucceeded
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            taskSucceeded(\n  index: Int,\n  result: Any): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            taskSucceeded executes the Result Handler Function with the given index and result.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            taskSucceeded marks the waiter finished successfully when all tasks have finished.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            taskSucceeded\u00a0is part of the JobListener abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/JobWaiter/#jobfailed","title":"jobFailed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            jobFailed(\n  exception: Exception): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            jobFailed marks the waiter failed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            jobFailed\u00a0is part of the JobListener abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/LiveListenerBus/","title":"LiveListenerBus","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            LiveListenerBus is an event bus to dispatch Spark events to registered SparkListeners.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            LiveListenerBus is a single-JVM SparkListenerBus that uses listenerThread to poll events.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The event queue is java.util.concurrent.LinkedBlockingQueue with capacity of 10000 SparkListenerEvent events.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/LiveListenerBus/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            LiveListenerBus takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkConf

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              LiveListenerBus is created (and started) when SparkContext is requested to initialize.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"scheduler/LiveListenerBus/#event-queues","title":"Event Queues
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              queues: CopyOnWriteArrayList[AsyncEventQueue]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              LiveListenerBus manages AsyncEventQueues.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              queues is initialized empty when LiveListenerBus is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              queues is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Registering Listener with Queue
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Posting Event to All Queues
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Deregistering Listener
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Starting LiveListenerBus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/LiveListenerBus/#livelistenerbusmetrics","title":"LiveListenerBusMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              metrics: LiveListenerBusMetrics\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              LiveListenerBus creates a LiveListenerBusMetrics when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              metrics is registered (with a MetricsSystem) when LiveListenerBus is started.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              metrics is used to:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Increment events posted every event posting
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Create a AsyncEventQueue when adding a listener to a queue
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/LiveListenerBus/#starting-livelistenerbus","title":"Starting LiveListenerBus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              start(\n  sc: SparkContext,\n  metricsSystem: MetricsSystem): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              start starts AsyncEventQueues (from the queues internal registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, start requests the given MetricsSystem to register the LiveListenerBusMetrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              start is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/LiveListenerBus/#posting-event-to-all-queues","title":"Posting Event to All Queues
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              post(\n  event: SparkListenerEvent): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              post puts the input event onto the internal eventQueue queue and releases the internal eventLock semaphore. If the event placement was not successful (and it could happen since it is tapped at 10000 events) onDropEvent method is called.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The event publishing is only possible when stopped flag has been enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              post is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/LiveListenerBus/#posttoqueues","title":"postToQueues
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              postToQueues(\n  event: SparkListenerEvent): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              postToQueues...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/LiveListenerBus/#event-dropped-callback","title":"Event Dropped Callback
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              onDropEvent(\n  event: SparkListenerEvent): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              onDropEvent is called when no further events can be added to the internal eventQueue queue (while posting a SparkListenerEvent event).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              It simply prints out the following ERROR message to the logs and ensures that it happens only once.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/LiveListenerBus/#stopping-livelistenerbus","title":"Stopping LiveListenerBus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              stop releases the internal eventLock semaphore and waits until listenerThread dies. It can only happen after all events were posted (and polling eventQueue gives nothing).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              stopped flag is enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/LiveListenerBus/#listenerthread-for-event-polling","title":"listenerThread for Event Polling

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              LiveListenerBus uses a SparkListenerBus single-daemon thread that ensures that the polling events from the event queue is only after the listener was started and only one event at a time.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-status-queue","title":"Registering Listener with Status Queue
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              addToStatusQueue(\n  listener: SparkListenerInterface): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              addToStatusQueue adds the given SparkListenerInterface to appStatus queue.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              addToStatusQueue is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BarrierCoordinator is requested to onStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • HiveThriftServer2 utility is used to createListenerAndUI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SharedState (Spark SQL) is requested to create a SQLAppStatusStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-shared-queue","title":"Registering Listener with Shared Queue
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              addToSharedQueue(\n  listener: SparkListenerInterface): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              addToSharedQueue adds the given SparkListenerInterface to shared queue.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              addToSharedQueue is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkContext is requested to register a SparkListener and register extra SparkListeners
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExecutionListenerBus (Spark Structured Streaming) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-executormanagement-queue","title":"Registering Listener with executorManagement Queue
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              addToManagementQueue(\n  listener: SparkListenerInterface): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              addToManagementQueue adds the given SparkListenerInterface to executorManagement queue.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              addToManagementQueue is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExecutorAllocationManager is requested to start
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • HeartbeatReceiver is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-eventlog-queue","title":"Registering Listener with eventLog Queue
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              addToEventLogQueue(\n  listener: SparkListenerInterface): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              addToEventLogQueue adds the given SparkListenerInterface to eventLog queue.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              addToEventLogQueue is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkContext is created (with event logging enabled)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-queue","title":"Registering Listener with Queue
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              addToQueue(\n  listener: SparkListenerInterface,\n  queue: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              addToQueue finds the queue in the queues internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              If found, addToQueue requests it to add the given listener

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              If not found, addToQueue creates a AsyncEventQueue (with the given name, the LiveListenerBusMetrics, and this LiveListenerBus) and requests it to add the given listener. The AsyncEventQueue is started and added to the queues internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              addToQueue is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • LiveListenerBus is requested to addToSharedQueue, addToManagementQueue, addToStatusQueue, addToEventLogQueue
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • StreamingQueryListenerBus (Spark Structured Streaming) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/MapOutputStatistics/","title":"MapOutputStatistics","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MapOutputStatistics holds statistics about the output partition sizes in a map stage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MapOutputStatistics is the result of executing the following (currently internal APIs):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkContext is requested to submitMapStage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • DAGScheduler is requested to submitMapStage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"scheduler/MapOutputStatistics/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MapOutputStatistics takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Shuffle Id (of a ShuffleDependency)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Output Partition Sizes (Array[Long])

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MapOutputStatistics is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MapOutputTrackerMaster is requested for the statistics (of a ShuffleDependency)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/MapOutputTracker/","title":"MapOutputTracker","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MapOutputTracker is an base abstraction of shuffle map output location registries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/MapOutputTracker/#contract","title":"Contract","text":""},{"location":"scheduler/MapOutputTracker/#getmapsizesbyexecutorid","title":"getMapSizesByExecutorId
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getMapSizesByExecutorId(\n  shuffleId: Int,\n  startPartition: Int,\n  endPartition: Int): Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SortShuffleManager is requested for a ShuffleReader
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/MapOutputTracker/#getmapsizesbyrange","title":"getMapSizesByRange
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getMapSizesByRange(\n  shuffleId: Int,\n  startMapIndex: Int,\n  endMapIndex: Int,\n  startPartition: Int,\n  endPartition: Int): Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SortShuffleManager is requested for a ShuffleReader
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/MapOutputTracker/#unregistershuffle","title":"unregisterShuffle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                unregisterShuffle(\n  shuffleId: Int): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Deletes map output status information for the specified shuffle stage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ContextCleaner is requested to doCleanupShuffle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManagerSlaveEndpoint is requested to handle a RemoveShuffle message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/MapOutputTracker/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MapOutputTrackerMaster
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MapOutputTrackerWorker
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/MapOutputTracker/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                MapOutputTracker takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkConf Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  MapOutputTracker\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete MapOutputTrackers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"scheduler/MapOutputTracker/#accessing-mapoutputtracker","title":"Accessing MapOutputTracker","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  MapOutputTracker is available using SparkEnv (on the driver and executors).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkEnv.get.mapOutputTracker\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"scheduler/MapOutputTracker/#mapoutputtracker-rpc-endpoint","title":"MapOutputTracker RPC Endpoint

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  trackerEndpoint is a RpcEndpointRef of the MapOutputTracker RPC endpoint.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  trackerEndpoint is initialized (registered or looked up) when SparkEnv is created for the driver and executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  trackerEndpoint is used to communicate (synchronously).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  trackerEndpoint is cleared (null) when MapOutputTrackerMaster is requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/MapOutputTracker/#deregistering-map-output-status-information-of-shuffle-stage","title":"Deregistering Map Output Status Information of Shuffle Stage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  unregisterShuffle(\n  shuffleId: Int): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Deregisters map output status information for the given shuffle stage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ContextCleaner is requested for shuffle cleanup

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManagerSlaveEndpoint is requested to remove a shuffle

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/MapOutputTracker/#stopping-mapoutputtracker","title":"Stopping MapOutputTracker
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  stop does nothing at all.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  stop is used when SparkEnv is requested to stop (and stops all the services, incl. MapOutputTracker).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/MapOutputTracker/#converting-mapstatuses-to-blockmanagerids-with-shuffleblockids-and-their-sizes","title":"Converting MapStatuses To BlockManagerIds with ShuffleBlockIds and Their Sizes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  convertMapStatuses(\n  shuffleId: Int,\n  startPartition: Int,\n  endPartition: Int,\n  statuses: Array[MapStatus]): Seq[(BlockManagerId, Seq[(BlockId, Long)])]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  convertMapStatuses iterates over the input statuses array (of MapStatus entries indexed by map id) and creates a collection of BlockManagerIds (for each MapStatus entry) with a ShuffleBlockId (with the input shuffleId, a mapId, and partition ranging from the input startPartition and endPartition) and estimated size for the reduce block for every status and partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  For any empty MapStatus, convertMapStatuses prints out the following ERROR message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Missing an output location for shuffle [id]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  And convertMapStatuses throws a MetadataFetchFailedException (with shuffleId, startPartition, and the above error message).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  convertMapStatuses is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • MapOutputTrackerMaster is requested for the sizes of shuffle map outputs by executor and range
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • MapOutputTrackerWorker is requested to sizes of shuffle map outputs by executor and range
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/MapOutputTracker/#sending-blocking-messages-to-trackerendpoint-rpcendpointref","title":"Sending Blocking Messages To trackerEndpoint RpcEndpointRef
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  askTracker[T](message: Any): T\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  askTracker sends the input message to trackerEndpoint RpcEndpointRef and waits for a result.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When an exception happens, askTracker prints out the following ERROR message to the logs and throws a SparkException.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Error communicating with MapOutputTracker\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  askTracker is used when MapOutputTracker is requested to fetches map outputs for ShuffleDependency remotely and sends a one-way message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/MapOutputTracker/#epoch","title":"Epoch

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Starts from 0 when MapOutputTracker is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Can be updated (on MapOutputTrackerWorkers) or incremented (on the driver's MapOutputTrackerMaster).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/MapOutputTracker/#sendtracker","title":"sendTracker
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  sendTracker(\n  message: Any): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  sendTracker...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  sendTracker is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • MapOutputTrackerMaster is requested to stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/MapOutputTracker/#utilities","title":"Utilities","text":""},{"location":"scheduler/MapOutputTracker/#serializemapstatuses","title":"serializeMapStatuses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  serializeMapStatuses(\n  statuses: Array[MapStatus],\n  broadcastManager: BroadcastManager,\n  isLocal: Boolean,\n  minBroadcastSize: Int,\n  conf: SparkConf): (Array[Byte], Broadcast[Array[Byte]])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  serializeMapStatuses serializes the given array of map output locations into an efficient byte format (to send to reduce tasks). serializeMapStatuses compresses the serialized bytes using GZIP. They are supposed to be pretty compressible because many map outputs will be on the same hostname.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Internally, serializeMapStatuses creates a Java ByteArrayOutputStream.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  serializeMapStatuses writes out 0 (direct) first.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  serializeMapStatuses creates a Java GZIPOutputStream (with the ByteArrayOutputStream created) and writes out the given statuses array.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  serializeMapStatuses decides whether to return the output array (of the output stream) or use a broadcast variable based on the size of the byte array.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If the size of the result byte array is the given minBroadcastSize threshold or bigger, serializeMapStatuses requests the input BroadcastManager to create a broadcast variable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  serializeMapStatuses resets the ByteArrayOutputStream and starts over.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  serializeMapStatuses writes out 1 (broadcast) first.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  serializeMapStatuses creates a new Java GZIPOutputStream (with the ByteArrayOutputStream created) and writes out the broadcast variable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  serializeMapStatuses prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Broadcast mapstatuses size = [length], actual size = [length]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  serializeMapStatuses is used when ShuffleStatus is requested to serialize shuffle map output statuses.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/MapOutputTracker/#deserializemapstatuses","title":"deserializeMapStatuses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  deserializeMapStatuses(\n  bytes: Array[Byte],\n  conf: SparkConf): Array[MapStatus]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  deserializeMapStatuses...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  deserializeMapStatuses is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • MapOutputTrackerWorker is requested to getStatuses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/MapOutputTrackerMaster/","title":"MapOutputTrackerMaster","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  MapOutputTrackerMaster is a MapOutputTracker for the driver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  MapOutputTrackerMaster is the source of truth of shuffle map output locations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"scheduler/MapOutputTrackerMaster/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  MapOutputTrackerMaster takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BroadcastManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • isLocal flag (to indicate whether MapOutputTrackerMaster runs in local or a cluster)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    When created, MapOutputTrackerMaster starts dispatcher threads on the map-output-dispatcher thread pool.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    MapOutputTrackerMaster is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkEnv utility is used to create a SparkEnv for the driver
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/MapOutputTrackerMaster/#maxrpcmessagesize","title":"maxRpcMessageSize

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    maxRpcMessageSize is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#broadcastmanager","title":"BroadcastManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    MapOutputTrackerMaster is given a BroadcastManager to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#shuffle-map-output-status-registry","title":"Shuffle Map Output Status Registry

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    MapOutputTrackerMaster uses an internal registry of ShuffleStatuses by shuffle stages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    MapOutputTrackerMaster adds a new shuffle when requested to register one (when DAGScheduler is requested to create a ShuffleMapStage for a ShuffleDependency).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    MapOutputTrackerMaster uses the registry when requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • registerMapOutput

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • getStatistics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • MessageLoop

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • unregisterMapOutput, unregisterAllMapOutput, unregisterShuffle, removeOutputsOnHost, removeOutputsOnExecutor, containsShuffle, getNumAvailableOutputs, findMissingPartitions, getLocationsWithLargestOutputs, getMapSizesByExecutorId

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    MapOutputTrackerMaster removes (clears) all shuffles when requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#configuration-properties","title":"Configuration Properties

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    MapOutputTrackerMaster uses the following configuration properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spark.shuffle.mapOutput.minSizeForBroadcast

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spark.shuffle.mapOutput.dispatcher.numThreads

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spark.shuffle.reduceLocality.enabled","text":""},{"location":"scheduler/MapOutputTrackerMaster/#map-and-reduce-task-thresholds-for-preferred-locations","title":"Map and Reduce Task Thresholds for Preferred Locations

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MapOutputTrackerMaster defines 1000 (tasks) as the hardcoded threshold of the number of map and reduce tasks when requested to compute preferred locations with spark.shuffle.reduceLocality.enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#map-output-threshold-for-preferred-location-of-reduce-tasks","title":"Map Output Threshold for Preferred Location of Reduce Tasks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MapOutputTrackerMaster defines 0.2 as the fraction of total map output that must be at a location for it to considered as a preferred location for a reduce task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Making this larger will focus on fewer locations where most data can be read locally, but may lead to more delay in scheduling if those locations are busy.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MapOutputTrackerMaster uses the fraction when requested for the preferred locations of shuffle RDDs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#getmapoutputmessage-queue","title":"GetMapOutputMessage Queue

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MapOutputTrackerMaster uses a blocking queue (a Java LinkedBlockingQueue) for requests for map output statuses.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      GetMapOutputMessage(\n  shuffleId: Int,\n  context: RpcCallContext)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      GetMapOutputMessage holds the shuffle ID and the RpcCallContext of the caller.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      A new GetMapOutputMessage is added to the queue when MapOutputTrackerMaster is requested to post one.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MapOutputTrackerMaster uses MessageLoop Dispatcher Threads to process GetMapOutputMessages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#messageloop-dispatcher-thread","title":"MessageLoop Dispatcher Thread

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MessageLoop is a thread of execution to handle GetMapOutputMessages until a PoisonPill marker message arrives (when MapOutputTrackerMaster is requested to stop).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MessageLoop takes a GetMapOutputMessage and prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Handling request to send map output locations for shuffle [shuffleId] to [hostPort]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MessageLoop then finds the ShuffleStatus by the shuffle ID in the shuffleStatuses internal registry and replies back (to the RPC client) with a serialized map output status (with the BroadcastManager and spark.shuffle.mapOutput.minSizeForBroadcast configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MessageLoop threads run on the map-output-dispatcher Thread Pool.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#map-output-dispatcher-thread-pool","title":"map-output-dispatcher Thread Pool
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      threadpool: ThreadPoolExecutor\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      threadpool is a daemon fixed thread pool registered with map-output-dispatcher thread name prefix.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      threadpool uses spark.shuffle.mapOutput.dispatcher.numThreads configuration property for the number of MessageLoop dispatcher threads to process received GetMapOutputMessage messages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The dispatcher threads are started immediately when MapOutputTrackerMaster is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The thread pool is shut down when MapOutputTrackerMaster is requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#epoch-number","title":"Epoch Number

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MapOutputTrackerMaster uses an epoch number to...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getEpoch is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • DAGScheduler is requested to removeExecutorAndUnregisterOutputs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskSetManager is created (and sets the epoch to tasks)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#enqueueing-getmapoutputmessage","title":"Enqueueing GetMapOutputMessage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      post(\n  message: GetMapOutputMessage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      post simply adds the input GetMapOutputMessage to the mapOutputRequests internal queue.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      post is used when MapOutputTrackerMasterEndpoint is requested to handle a GetMapOutputStatuses message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#stopping-mapoutputtrackermaster","title":"Stopping MapOutputTrackerMaster
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      stop...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      stop is part of the MapOutputTracker abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#unregistering-shuffle-map-output","title":"Unregistering Shuffle Map Output
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      unregisterMapOutput(\n  shuffleId: Int,\n  mapId: Int,\n  bmAddress: BlockManagerId): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      unregisterMapOutput...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      unregisterMapOutput is used when DAGScheduler is requested to handle a task completion (due to a fetch failure).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#computing-preferred-locations","title":"Computing Preferred Locations
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getPreferredLocationsForShuffle(\n  dep: ShuffleDependency[_, _, _],\n  partitionId: Int): Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getPreferredLocationsForShuffle computes the locations (BlockManagers) with the most shuffle map outputs for the input ShuffleDependency and Partition.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getPreferredLocationsForShuffle computes the locations when all of the following are met:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • spark.shuffle.reduceLocality.enabled configuration property is enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • The number of \"map\" partitions (of the RDD of the input ShuffleDependency) is below SHUFFLE_PREF_MAP_THRESHOLD

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • The number of \"reduce\" partitions (of the Partitioner of the input ShuffleDependency) is below SHUFFLE_PREF_REDUCE_THRESHOLD

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getPreferredLocationsForShuffle is simply getLocationsWithLargestOutputs with a guard condition.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Internally, getPreferredLocationsForShuffle checks whether spark.shuffle.reduceLocality.enabled configuration property is enabled with the number of partitions of the RDD of the input ShuffleDependency and partitions in the partitioner of the input ShuffleDependency both being less than 1000.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The thresholds for the number of partitions in the RDD and of the partitioner when computing the preferred locations are 1000 and are not configurable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      If the condition holds, getPreferredLocationsForShuffle finds locations with the largest number of shuffle map outputs for the input ShuffleDependency and partitionId (with the number of partitions in the partitioner of the input ShuffleDependency and 0.2) and returns the hosts of the preferred BlockManagers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0.2 is the fraction of total map output that must be at a location to be considered as a preferred location for a reduce task. It is not configurable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getPreferredLocationsForShuffle is used when ShuffledRDD and Spark SQL's ShuffledRowRDD are requested for preferred locations of a partition.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-locations-with-largest-number-of-shuffle-map-outputs","title":"Finding Locations with Largest Number of Shuffle Map Outputs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getLocationsWithLargestOutputs(\n  shuffleId: Int,\n  reducerId: Int,\n  numReducers: Int,\n  fractionThreshold: Double): Option[Array[BlockManagerId]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getLocationsWithLargestOutputs returns BlockManagerIds with the largest size (of all the shuffle blocks they manage) above the input fractionThreshold (given the total size of all the shuffle blocks for the shuffle across all BlockManagers).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getLocationsWithLargestOutputs may return no BlockManagerId if their shuffle blocks do not total up above the input fractionThreshold.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The input numReducers is not used.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Internally, getLocationsWithLargestOutputs queries the mapStatuses internal cache for the input shuffleId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      One entry in mapStatuses internal cache is a MapStatus array indexed by partition id.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MapStatus includes information about the BlockManager (as BlockManagerId) and estimated size of the reduce blocks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getLocationsWithLargestOutputs iterates over the MapStatus array and builds an interim mapping between BlockManagerId and the cumulative sum of shuffle blocks across BlockManagers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#incrementing-epoch","title":"Incrementing Epoch
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      incrementEpoch(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      incrementEpoch increments the internal epoch.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      incrementEpoch prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Increasing epoch to [epoch]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      incrementEpoch is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • MapOutputTrackerMaster is requested to unregisterMapOutput, unregisterAllMapOutput, removeOutputsOnHost and removeOutputsOnExecutor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • DAGScheduler is requested to handle a ShuffleMapTask completion (of a ShuffleMapStage)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#checking-availability-of-shuffle-map-output-status","title":"Checking Availability of Shuffle Map Output Status
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      containsShuffle(\n  shuffleId: Int): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      containsShuffle checks if the input shuffleId is registered in the cachedSerializedStatuses or mapStatuses internal caches.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      containsShuffle is used when DAGScheduler is requested to create a createShuffleMapStage (for a ShuffleDependency).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#registering-shuffle","title":"Registering Shuffle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      registerShuffle(\n  shuffleId: Int,\n  numMaps: Int): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      registerShuffle registers a new ShuffleStatus (for the given shuffle ID and the number of partitions) to the shuffleStatuses internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      registerShuffle throws an IllegalArgumentException when the shuffle ID has already been registered:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Shuffle ID [shuffleId] registered twice\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      registerShuffle is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • DAGScheduler is requested to create a ShuffleMapStage (for a ShuffleDependency)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#registering-map-outputs-for-shuffle-possibly-with-epoch-change","title":"Registering Map Outputs for Shuffle (Possibly with Epoch Change)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      registerMapOutputs(\n  shuffleId: Int,\n  statuses: Array[MapStatus],\n  changeEpoch: Boolean = false): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      registerMapOutputs registers the input statuses (as the shuffle map output) with the input shuffleId in the mapStatuses internal cache.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      registerMapOutputs increments epoch if the input changeEpoch is enabled (it is not by default).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      registerMapOutputs is used when DAGScheduler handles successful ShuffleMapTask completion and executor lost events.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-serialized-map-output-statuses-and-possibly-broadcasting-them","title":"Finding Serialized Map Output Statuses (And Possibly Broadcasting Them)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getSerializedMapOutputStatuses(\n  shuffleId: Int): Array[Byte]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getSerializedMapOutputStatuses finds cached serialized map statuses for the input shuffleId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      If found, getSerializedMapOutputStatuses returns the cached serialized map statuses.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Otherwise, getSerializedMapOutputStatuses acquires the shuffle lock for shuffleId and finds cached serialized map statuses again since some other thread could not update the cachedSerializedStatuses internal cache.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getSerializedMapOutputStatuses returns the serialized map statuses if found.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      If not, getSerializedMapOutputStatuses serializes the local array of MapStatuses (from checkCachedStatuses).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getSerializedMapOutputStatuses prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Size of output statuses for shuffle [shuffleId] is [bytes] bytes\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getSerializedMapOutputStatuses saves the serialized map output statuses in cachedSerializedStatuses internal cache if the epoch has not changed in the meantime. getSerializedMapOutputStatuses also saves its broadcast version in cachedSerializedBroadcast internal cache.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      If the epoch has changed in the meantime, the serialized map output statuses and their broadcast version are not saved, and getSerializedMapOutputStatuses prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Epoch changed, not caching!\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getSerializedMapOutputStatuses removes the broadcast.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getSerializedMapOutputStatuses returns the serialized map statuses.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getSerializedMapOutputStatuses is used when MapOutputTrackerMaster responds to GetMapOutputMessage requests and DAGScheduler creates ShuffleMapStage for ShuffleDependency (copying the shuffle map output locations from previous jobs to avoid unnecessarily regenerating data).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-cached-serialized-map-statuses","title":"Finding Cached Serialized Map Statuses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      checkCachedStatuses(): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      checkCachedStatuses is an internal helper method that <> uses to do some bookkeeping (when the <> and <> differ) and set local statuses, retBytes and epochGotten (that getSerializedMapOutputStatuses uses).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Internally, checkCachedStatuses acquires the MapOutputTracker.md#epochLock[epochLock lock] and checks the status of <> to <cacheEpoch>>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      If epoch is younger (i.e. greater), checkCachedStatuses clears <> internal cache, <> and sets cacheEpoch to be epoch.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      checkCachedStatuses gets the serialized map output statuses for the shuffleId (of the owning <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      When the serialized map output status is found, checkCachedStatuses saves it in a local retBytes and returns true.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      When not found, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      cached status not found for : [shuffleId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      checkCachedStatuses uses MapOutputTracker.md#mapStatuses[mapStatuses] internal cache to get map output statuses for the shuffleId (of the owning <>) or falls back to an empty array and sets it to a local statuses. checkCachedStatuses sets the local epochGotten to the current <> and returns false.","text":""},{"location":"scheduler/MapOutputTrackerMaster/#registering-shuffle-map-output","title":"Registering Shuffle Map Output

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      registerMapOutput(\n  shuffleId: Int,\n  mapId: Int,\n  status: MapStatus): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      registerMapOutput finds the ShuffleStatus by the given shuffle ID and adds the given MapStatus:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • The given mapId is the partitionId of the ShuffleMapTask that finished.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • The given shuffleId is the shuffleId of the ShuffleDependency of the ShuffleMapStage (for which the ShuffleMapTask completed)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      registerMapOutput is used when DAGScheduler is requested to handle a ShuffleMapTask completion.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#map-output-statistics-for-shuffledependency","title":"Map Output Statistics for ShuffleDependency
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getStatistics(\n  dep: ShuffleDependency[_, _, _]): MapOutputStatistics\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getStatistics requests the input ShuffleDependency for the shuffle ID and looks up the corresponding ShuffleStatus (in the shuffleStatuses registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getStatistics assumes that the ShuffleStatus is in shuffleStatuses registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getStatistics requests the ShuffleStatus for the MapStatuses (of the ShuffleDependency).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getStatistics uses the spark.shuffle.mapOutput.parallelAggregationThreshold configuration property to decide on parallelism to calculate the statistics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      With no parallelism, getStatistics simply traverses over the MapStatuses and requests them (one by one) for the size of every shuffle block.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getStatistics requests the given ShuffleDependency for the Partitioner that in turn is requested for the number of partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The number of blocks is the number of MapStatuses multiplied by the number of partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      And hence the need for parallelism based on the spark.shuffle.mapOutput.parallelAggregationThreshold configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In the end, getStatistics creates a MapOutputStatistics with the shuffle ID (of the given ShuffleDependency) and the total sizes (sumed up for every partition).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getStatistics is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • DAGScheduler is requested to handle a successful ShuffleMapStage submission and markMapStageJobsAsFinished
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#deregistering-all-map-outputs-of-shuffle-stage","title":"Deregistering All Map Outputs of Shuffle Stage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      unregisterAllMapOutput(\n  shuffleId: Int): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      unregisterAllMapOutput...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      unregisterAllMapOutput is used when DAGScheduler is requested to handle a task completion (due to a fetch failure).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#deregistering-shuffle","title":"Deregistering Shuffle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      unregisterShuffle(\n  shuffleId: Int): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      unregisterShuffle...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      unregisterShuffle is part of the MapOutputTracker abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#deregistering-shuffle-outputs-associated-with-host","title":"Deregistering Shuffle Outputs Associated with Host
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      removeOutputsOnHost(\n  host: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      removeOutputsOnHost...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      removeOutputsOnHost is used when DAGScheduler is requested to removeExecutorAndUnregisterOutputs and handle a worker removal.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#deregistering-shuffle-outputs-associated-with-executor","title":"Deregistering Shuffle Outputs Associated with Executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      removeOutputsOnExecutor(\n  execId: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      removeOutputsOnExecutor...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      removeOutputsOnExecutor is used when DAGScheduler is requested to removeExecutorAndUnregisterOutputs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#number-of-partitions-with-shuffle-map-outputs-available","title":"Number of Partitions with Shuffle Map Outputs Available
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getNumAvailableOutputs(\n  shuffleId: Int): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getNumAvailableOutputs...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getNumAvailableOutputs is used when ShuffleMapStage is requested for the number of partitions with shuffle outputs available.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-missing-partitions","title":"Finding Missing Partitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      findMissingPartitions(\n  shuffleId: Int): Option[Seq[Int]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      findMissingPartitions...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      findMissingPartitions is used when ShuffleMapStage is requested for missing partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-locations-with-blocks-and-sizes","title":"Finding Locations with Blocks and Sizes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getMapSizesByExecutorId(\n  shuffleId: Int,\n  startPartition: Int,\n  endPartition: Int): Iterator[(BlockManagerId, Seq[(BlockId, Long)])]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getMapSizesByExecutorId is part of the MapOutputTracker abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getMapSizesByExecutorId returns a collection of BlockManagerIds with their blocks and sizes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      When executed, getMapSizesByExecutorId prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Fetching outputs for shuffle [id], partitions [startPartition]-[endPartition]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getMapSizesByExecutorId finds map outputs for the input shuffleId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getMapSizesByExecutorId gets the map outputs for all the partitions (despite the method's signature).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In the end, getMapSizesByExecutorId converts shuffle map outputs (as MapStatuses) into the collection of BlockManagerIds with their blocks and sizes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMaster/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Enable ALL logging level for org.apache.spark.MapOutputTrackerMaster logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      log4j.logger.org.apache.spark.MapOutputTrackerMaster=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/MapOutputTrackerMasterEndpoint/","title":"MapOutputTrackerMasterEndpoint","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MapOutputTrackerMasterEndpoint is an RpcEndpoint for MapOutputTrackerMaster.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MapOutputTrackerMasterEndpoint is registered under the name of MapOutputTracker (on the driver).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      MapOutputTrackerMasterEndpoint takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • RpcEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • MapOutputTrackerMaster
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SparkConf

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        MapOutputTrackerMasterEndpoint is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkEnv is created (for the driver and executors)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        While being created, MapOutputTrackerMasterEndpoint prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        init\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#messages","title":"Messages","text":""},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#getmapoutputstatuses","title":"GetMapOutputStatuses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        GetMapOutputStatuses(\n  shuffleId: Int)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when MapOutputTrackerWorker is requested for shuffle map outputs for a given shuffle ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, MapOutputTrackerMasterEndpoint prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Asked to send map output locations for shuffle [shuffleId] to [hostPort]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the end, MapOutputTrackerMasterEndpoint requests the MapOutputTrackerMaster to post a GetMapOutputMessage (with the input shuffleId). Whatever is returned from MapOutputTrackerMaster becomes the response.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#stopmapoutputtracker","title":"StopMapOutputTracker

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when MapOutputTrackerMaster is requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, MapOutputTrackerMasterEndpoint prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        MapOutputTrackerMasterEndpoint stopped!\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        MapOutputTrackerMasterEndpoint confirms the request (by replying true) and stops.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Enable ALL logging level for org.apache.spark.MapOutputTrackerMasterEndpoint logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        log4j.logger.org.apache.spark.MapOutputTrackerMasterEndpoint=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/MapOutputTrackerWorker/","title":"MapOutputTrackerWorker","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        MapOutputTrackerWorker is the MapOutputTracker for executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        MapOutputTrackerWorker uses Java's thread-safe java.util.concurrent.ConcurrentHashMap for mapStatuses internal cache and any lookup cache miss triggers a fetch from the driver's MapOutputTrackerMaster.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[getStatuses]] Finding Shuffle Map Outputs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/MapOutputTrackerWorker/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getStatuses( shuffleId: Int): Array[MapStatus]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getStatuses finds MapStatus.md[MapStatuses] for the input shuffleId in the <> internal cache and, when not available, fetches them from a remote MapOutputTrackerMaster.md[MapOutputTrackerMaster] (using RPC).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Internally, getStatuses first queries the <mapStatuses internal cache>> and returns the map outputs if found.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        If not found (in the mapStatuses internal cache), you should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Don't have map outputs for shuffle [id], fetching them\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        If some other process fetches the map outputs for the shuffleId (as recorded in fetching internal registry), getStatuses waits until it is done.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When no other process fetches the map outputs, getStatuses registers the input shuffleId in fetching internal registry (of shuffle map outputs being fetched).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        You should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Doing the fetch; tracker endpoint = [trackerEndpoint]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getStatuses sends a GetMapOutputStatuses RPC remote message for the input shuffleId to the trackerEndpoint expecting a Array[Byte].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: getStatuses requests shuffle map outputs remotely within a timeout and with retries. Refer to rpc:RpcEndpointRef.md[RpcEndpointRef].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getStatuses <> and records the result in the <mapStatuses internal cache>>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        You should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Got the output locations\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getStatuses removes the input shuffleId from fetching internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        You should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Fetching map output statuses for shuffle [id] took [time] ms\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        If getStatuses could not find the map output locations for the input shuffleId (locally and remotely), you should see the following ERROR message in the logs and throws a MetadataFetchFailedException.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Missing all output locations for shuffle [id]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: getStatuses is used when MapOutputTracker <> and <ShuffleDependency>>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[logging]] Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Enable ALL logging level for org.apache.spark.MapOutputTrackerWorker logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/MapOutputTrackerWorker/#source","title":"[source]","text":""},{"location":"scheduler/MapOutputTrackerWorker/#log4jloggerorgapachesparkmapoutputtrackerworkerall","title":"log4j.logger.org.apache.spark.MapOutputTrackerWorker=ALL","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Refer to spark-logging.md[Logging].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/MapStatus/","title":"MapStatus","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        MapStatus is an abstraction of shuffle map output statuses with an estimated size, location and map Id.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        MapStatus is a result of executing a ShuffleMapTask.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        After a ShuffleMapTask has finished execution successfully, DAGScheduler is requested to handle a ShuffleMapTask completion that in turn requests the MapOutputTrackerMaster to register the MapStatus.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/MapStatus/#contract","title":"Contract","text":""},{"location":"scheduler/MapStatus/#estimated-size","title":"Estimated Size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getSizeForBlock(\n  reduceId: Int): Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Estimated size (in bytes)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MapOutputTrackerMaster is requested for a MapOutputStatistics and locations with the largest number of shuffle map outputs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MapOutputTracker utility is used to convert MapStatuses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • OptimizeSkewedJoin (Spark SQL) physical optimization is executed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/MapStatus/#location","title":"Location
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        location: BlockManagerId\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        BlockManagerId of the shuffle map output (i.e. the BlockManager where a ShuffleMapTask ran and the result is stored)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ShuffleStatus is requested to removeMapOutput and removeOutputsByFilter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MapOutputTrackerMaster is requested for locations with the largest number of shuffle map outputs and getMapLocation
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MapOutputTracker utility is used to convert MapStatuses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DAGScheduler is requested to handle a ShuffleMapTask completion
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/MapStatus/#map-id","title":"Map Id
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        mapId: Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Map Id of the shuffle map output

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • MapOutputTracker utility is used to convert MapStatuses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/MapStatus/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • CompressedMapStatus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • HighlyCompressedMapStatus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Sealed Trait

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        MapStatus is a Scala sealed trait which means that all of the implementations are in the same compilation unit (a single file).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/MapStatus/#sparkshuffleminnumpartitionstohighlycompress","title":"spark.shuffle.minNumPartitionsToHighlyCompress

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        MapStatus utility uses spark.shuffle.minNumPartitionsToHighlyCompress internal configuration property for the minimum number of partitions to prefer a HighlyCompressedMapStatus.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/MapStatus/#creating-mapstatus","title":"Creating MapStatus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        apply(\n  loc: BlockManagerId,\n  uncompressedSizes: Array[Long],\n  mapTaskId: Long): MapStatus\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        apply creates a HighlyCompressedMapStatus when the number of uncompressedSizes is above minPartitionsToUseHighlyCompressMapStatus threshold. Otherwise, apply creates a CompressedMapStatus.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        apply is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SortShuffleWriter is requested to write records
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BypassMergeSortShuffleWriter is requested to write records
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • UnsafeShuffleWriter is requested to close resources and write out merged spill files
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/Pool/","title":"Pool","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[Pool]] Schedulable Pool

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Pool is a scheduler:spark-scheduler-Schedulable.md[Schedulable] entity that represents a tree of scheduler:TaskSetManager.md[TaskSetManagers], i.e. it contains a collection of TaskSetManagers or the Pools thereof.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        A Pool has a mandatory name, a spark-scheduler-SchedulingMode.md[scheduling mode], initial minShare and weight that are defined when it is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: An instance of Pool is created when scheduler:TaskSchedulerImpl.md#initialize[TaskSchedulerImpl is initialized].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: The scheduler:TaskScheduler.md#contract[TaskScheduler Contract] and spark-scheduler-Schedulable.md#contract[Schedulable Contract] both require that their entities have rootPool of type Pool.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[increaseRunningTasks]] increaseRunningTasks Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[decreaseRunningTasks]] decreaseRunningTasks Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[taskSetSchedulingAlgorithm]] taskSetSchedulingAlgorithm Attribute

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Using the spark-scheduler-SchedulingMode.md[scheduling mode] (given when a Pool object is created), Pool selects <> and sets taskSetSchedulingAlgorithm:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • <> for FIFO scheduling mode.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • <> for FAIR scheduling mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          It throws an IllegalArgumentException when unsupported scheduling mode is passed on:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Unsupported spark.scheduler.mode: [schedulingMode]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          TIP: Read about the scheduling modes in spark-scheduler-SchedulingMode.md[SchedulingMode].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: taskSetSchedulingAlgorithm is used in <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[getSortedTaskSetQueue]] Getting TaskSetManagers Sorted -- getSortedTaskSetQueue Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: getSortedTaskSetQueue is part of the spark-scheduler-Schedulable.md#contract[Schedulable Contract].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getSortedTaskSetQueue sorts all the spark-scheduler-Schedulable.md[Schedulables] in spark-scheduler-Schedulable.md#contract[schedulableQueue] queue by a <> (from the internal <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: It is called when scheduler:TaskSchedulerImpl.md#resourceOffers[TaskSchedulerImpl processes executor resource offers].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[schedulableNameToSchedulable]] Schedulables by Name -- schedulableNameToSchedulable Registry

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/Pool/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/Pool/#schedulablenametoschedulable-new-concurrenthashmapstring-schedulable","title":"schedulableNameToSchedulable = new ConcurrentHashMap[String, Schedulable]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          schedulableNameToSchedulable is a lookup table of spark-scheduler-Schedulable.md[Schedulable] objects by their names.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Beside the obvious usage in the housekeeping methods like addSchedulable, removeSchedulable, getSchedulableByName from the spark-scheduler-Schedulable.md#contract[Schedulable Contract], it is exclusively used in SparkContext.md#getPoolForName[SparkContext.getPoolForName].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[addSchedulable]] addSchedulable Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: addSchedulable is part of the spark-scheduler-Schedulable.md#contract[Schedulable Contract].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          addSchedulable adds a Schedulable to the spark-scheduler-Schedulable.md#contract[schedulableQueue] and <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          More importantly, it sets the Schedulable entity's spark-scheduler-Schedulable.md#contract[parent] to itself.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[removeSchedulable]] removeSchedulable Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: removeSchedulable is part of the spark-scheduler-Schedulable.md#contract[Schedulable Contract].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          removeSchedulable removes a Schedulable from the spark-scheduler-Schedulable.md#contract[schedulableQueue] and <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: removeSchedulable is the opposite to <addSchedulable method>>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[SchedulingAlgorithm]] SchedulingAlgorithm

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SchedulingAlgorithm is the interface for a sorting algorithm to sort spark-scheduler-Schedulable.md[Schedulables].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          There are currently two SchedulingAlgorithms:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • <> for FIFO scheduling mode.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • <> for FAIR scheduling mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ==== [[FIFOSchedulingAlgorithm]] FIFOSchedulingAlgorithm

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            FIFOSchedulingAlgorithm is a scheduling algorithm that compares Schedulables by their priority first and, when equal, by their stageId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: priority and stageId are part of spark-scheduler-Schedulable.md#contract[Schedulable Contract].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME A picture is worth a thousand words. How to picture the algorithm?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ==== [[FairSchedulingAlgorithm]] FairSchedulingAlgorithm

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            FairSchedulingAlgorithm is a scheduling algorithm that compares Schedulables by their minShare, runningTasks, and weight.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: minShare, runningTasks, and weight are part of spark-scheduler-Schedulable.md#contract[Schedulable Contract].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            .FairSchedulingAlgorithm image::spark-pool-FairSchedulingAlgorithm.png[align=\"center\"]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For each input Schedulable, minShareRatio is computed as runningTasks by minShare (but at least 1) while taskToWeightRatio is runningTasks by weight.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[getSchedulableByName]] Finding Schedulable by Name -- getSchedulableByName Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/Pool/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/Pool/#getschedulablebynameschedulablename-string-schedulable","title":"getSchedulableByName(schedulableName: String): Schedulable","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: getSchedulableByName is part of the <> to find a <> by name.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getSchedulableByName...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/ResultStage/","title":"ResultStage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ResultStage is the final stage in a job that applies a function to one or many partitions of the target RDD to compute the result of an action.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The partitions are given as a collection of partition ids (partitions) and the function func: (TaskContext, Iterator[_]) => _.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[findMissingPartitions]] Finding Missing Partitions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/ResultStage/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/ResultStage/#findmissingpartitions-seqint","title":"findMissingPartitions(): Seq[Int]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: findMissingPartitions is part of the scheduler:Stage.md#findMissingPartitions[Stage] abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            findMissingPartitions...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            .ResultStage.findMissingPartitions and ActiveJob image::resultstage-findMissingPartitions.png[align=\"center\"]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In the above figure, partitions 1 and 2 are not finished (F is false while T is true).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[func]] func Property

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[setActiveJob]] setActiveJob Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[removeActiveJob]] removeActiveJob Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[activeJob]] activeJob Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/ResultStage/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/ResultStage/#activejob-optionactivejob","title":"activeJob: Option[ActiveJob]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            activeJob returns the optional ActiveJob associated with a ResultStage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME When/why would that be NONE (empty)?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/ResultTask/","title":"ResultTask","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ResultTask[T, U] is a Task that executes a partition processing function on a partition with records (of type T) to produce a result (of type U) that is sent back to the driver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            T -- [ResultTask] --> U\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/ResultTask/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ResultTask takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Stage ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Stage Attempt ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Broadcast variable with a serialized task (Broadcast[Array[Byte]])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Partition to compute
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskLocation
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Output ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Local Properties
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Serialized TaskMetrics (Array[Byte])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ActiveJob ID (optional)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Application ID (optional)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Application Attempt ID (optional)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • isBarrier flag (default: false)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ResultTask is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • DAGScheduler is requested to submit missing tasks of a ResultStage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"scheduler/ResultTask/#running-task","title":"Running Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              runTask(\n  context: TaskContext): U\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              runTask\u00a0is part of the Task abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              runTask deserializes a RDD and a partition processing function from the broadcast variable (using the Closure Serializer).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, runTask executes the function (on the records from the partition of the RDD).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/Schedulable/","title":"Schedulable","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              == [[Schedulable]] Schedulable Contract -- Schedulable Entities

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Schedulable is the <> of <> that manages the <> and can <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [[contract]] .Schedulable Contract [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | addSchedulable a| [[addSchedulable]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"scheduler/Schedulable/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#addschedulableschedulable-schedulable-unit","title":"addSchedulable(schedulable: Schedulable): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Registers a <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • FIFOSchedulableBuilder is requested to <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • FairSchedulableBuilder is requested to <>, <>, and <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | checkSpeculatableTasks a| [[checkSpeculatableTasks]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/Schedulable/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#checkspeculatabletasksmintimetospeculation-int-boolean","title":"checkSpeculatableTasks(minTimeToSpeculation: Int): Boolean","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | executorLost a| [[executorLost]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/Schedulable/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                executorLost( executorId: String, host: String, reason: ExecutorLossReason): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Handles an executor lost event

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Pool is requested to <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • TaskSchedulerImpl is requested to scheduler:TaskSchedulerImpl.md#removeExecutor[removeExecutor]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • | getSchedulableByName a| [[getSchedulableByName]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"scheduler/Schedulable/#source-scala_3","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#getschedulablebynamename-string-schedulable","title":"getSchedulableByName(name: String): Schedulable","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Finds a <> by name

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | getSortedTaskSetQueue a| [[getSortedTaskSetQueue]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"scheduler/Schedulable/#source-scala_4","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#getsortedtasksetqueue-arraybuffertasksetmanager","title":"getSortedTaskSetQueue: ArrayBuffer[TaskSetManager]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Builds a collection of scheduler:TaskSetManager.md[TaskSetManagers] sorted by <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Pool is requested to <> (recursively)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TaskSchedulerImpl is requested to scheduler:TaskSchedulerImpl.md#resourceOffers[resourceOffers]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • | minShare a| [[minShare]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/Schedulable/#source-scala_5","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#minshare-int","title":"minShare: Int","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | name a| [[name]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/Schedulable/#source-scala_6","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#name-string","title":"name: String","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | parent a| [[parent]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/Schedulable/#source-scala_7","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#parent-pool","title":"parent: Pool","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | priority a| [[priority]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/Schedulable/#source-scala_8","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#priority-int","title":"priority: Int","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | removeSchedulable a| [[removeSchedulable]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/Schedulable/#source-scala_9","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#removeschedulableschedulable-schedulable-unit","title":"removeSchedulable(schedulable: Schedulable): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | runningTasks a| [[runningTasks]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/Schedulable/#source-scala_10","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#runningtasks-int","title":"runningTasks: Int","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | schedulableQueue a| [[schedulableQueue]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/Schedulable/#source-scala_11","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#schedulablequeue-concurrentlinkedqueueschedulable","title":"schedulableQueue: ConcurrentLinkedQueue[Schedulable]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Queue of <> (as https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html[ConcurrentLinkedQueue])

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkContext is requested to SparkContext.md#getAllPools[getAllPools]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Pool is requested to <>, <>, <>, <>, <>, and <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | schedulingMode a| [[schedulingMode]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/Schedulable/#source-scala_12","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#schedulingmode-schedulingmode","title":"schedulingMode: SchedulingMode","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Pool is <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • web UI's PoolTable is requested to render a page with pools (poolRow)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • | stageId a| [[stageId]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/Schedulable/#source-scala_13","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#stageid-int","title":"stageId: Int","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | weight a| [[weight]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/Schedulable/#source-scala_14","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#weight-int","title":"weight: Int","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        [[implementations]] .Schedulables [cols=\"1,3\",options=\"header\",width=\"100%\"] |=== | Schedulable | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | <> | [[Pool]] Pool of <> (i.e. a recursive data structure for prioritizing task sets)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | scheduler:TaskSetManager.md[TaskSetManager] | [[TaskSetManager]] Manages scheduling of tasks of a scheduler:TaskSet.md[TaskSet]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/SchedulableBuilder/","title":"SchedulableBuilder","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[SchedulableBuilder]] SchedulableBuilder Contract -- Builders of Schedulable Pools

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        SchedulableBuilder is the <> of <> that manage a <>, which is to <> and <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        SchedulableBuilder is a private[spark] Scala trait that is used exclusively by scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl] (the default Spark scheduler). When requested to scheduler:TaskSchedulerImpl.md#initialize[initialize], TaskSchedulerImpl uses the configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property (default: FIFO) to select one of the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        [[contract]] .SchedulableBuilder Contract [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | addTaskSetManager a| [[addTaskSetManager]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/SchedulableBuilder/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/SchedulableBuilder/#addtasksetmanagermanager-schedulable-properties-properties-unit","title":"addTaskSetManager(manager: Schedulable, properties: Properties): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Registers a new <> with the <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used exclusively when TaskSchedulerImpl is requested to scheduler:TaskSchedulerImpl.md#submitTasks[submit tasks (of TaskSet) for execution] (and registers a new scheduler:TaskSetManager.md[TaskSetManager] for the TaskSet)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | buildPools a| [[buildPools]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/SchedulableBuilder/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/SchedulableBuilder/#buildpools-unit","title":"buildPools(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Builds a tree of <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used exclusively when TaskSchedulerImpl is requested to scheduler:TaskSchedulerImpl.md#initialize[initialize] (and creates a scheduler:TaskSchedulerImpl.md#schedulableBuilder[SchedulableBuilder] per configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | rootPool a| [[rootPool]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/SchedulableBuilder/#source-scala_2","title":"[source, scala]","text":""},{"location":"scheduler/SchedulableBuilder/#rootpool-pool","title":"rootPool: Pool","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Root (top-level) <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • FIFOSchedulableBuilder is requested to <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • FairSchedulableBuilder is requested to <>, <>, and <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[implementations]] .SchedulableBuilders [cols=\"1,3\",options=\"header\",width=\"100%\"] |=== | SchedulableBuilder | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | <> | [[FairSchedulableBuilder]] Used when the configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property is FAIR

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | <> | [[FIFOSchedulableBuilder]] Default SchedulableBuilder that is used when the configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property is FIFO (default)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/SchedulerBackend/","title":"SchedulerBackend","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SchedulerBackend is an abstraction of task scheduling backends that can revive resource offers from cluster managers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SchedulerBackend abstraction allows TaskSchedulerImpl to use variety of cluster managers (with their own resource offers and task scheduling modes).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Being a scheduler backend system assumes a Apache Mesos-like scheduling model in which \"an application\" gets resource offers as machines become available so it is possible to launch tasks on them. Once required resource allocation is obtained, the scheduler backend can start executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/SchedulerBackend/#contract","title":"Contract","text":""},{"location":"scheduler/SchedulerBackend/#applicationattemptid","title":"applicationAttemptId
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          applicationAttemptId(): Option[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Execution attempt ID of this Spark application

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Default: None (undefined)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskSchedulerImpl is requested for the execution attempt ID of a Spark application
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/SchedulerBackend/#applicationid","title":"applicationId
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          applicationId(): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Unique identifier of this Spark application

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Default: spark-application-[currentTimeMillis]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskSchedulerImpl is requested for the unique identifier of a Spark application
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/SchedulerBackend/#default-parallelism","title":"Default Parallelism
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          defaultParallelism(): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Default parallelism, i.e. a hint for the number of tasks in stages while sizing jobs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskSchedulerImpl is requested for the default parallelism
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/SchedulerBackend/#getdriverattributes","title":"getDriverAttributes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getDriverAttributes: Option[Map[String, String]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Default: None

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkContext is requested to postApplicationStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/SchedulerBackend/#getdriverlogurls","title":"getDriverLogUrls
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getDriverLogUrls: Option[Map[String, String]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Driver log URLs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Default: None (undefined)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkContext is requested to postApplicationStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/SchedulerBackend/#isready","title":"isReady
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          isReady(): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Controls whether this SchedulerBackend is ready (true) or not (false)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskSchedulerImpl is requested to wait until scheduling backend is ready
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/SchedulerBackend/#killing-task","title":"Killing Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          killTask(\n  taskId: Long,\n  executorId: String,\n  interruptThread: Boolean,\n  reason: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Kills a given task

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Default: UnsupportedOperationException

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskSchedulerImpl is requested to killTaskAttempt and killAllTaskAttempts
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskSetManager is requested to handle a successful task attempt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/SchedulerBackend/#maxNumConcurrentTasks","title":"Maximum Number of Concurrent Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          maxNumConcurrentTasks(\n  rp: ResourceProfile): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Maximum number of concurrent tasks that can be launched (based on the given ResourceProfile)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          See:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • CoarseGrainedSchedulerBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • LocalSchedulerBackend

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkContext is requested for the maximum number of concurrent tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/SchedulerBackend/#reviveoffers","title":"reviveOffers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          reviveOffers(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Handles resource allocation offers (from the scheduling system)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when TaskSchedulerImpl is requested to:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Submit tasks (from a TaskSet)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Handle a task status update

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Notify the TaskSetManager that a task has failed

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Check for speculatable tasks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Handle a lost executor event

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/SchedulerBackend/#starting-schedulerbackend","title":"Starting SchedulerBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          start(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Starts this SchedulerBackend

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskSchedulerImpl is requested to start
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/SchedulerBackend/#stop","title":"stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Stops this SchedulerBackend

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskSchedulerImpl is requested to stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/SchedulerBackend/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • CoarseGrainedSchedulerBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • LocalSchedulerBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • MesosFineGrainedSchedulerBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/SchedulerBackendUtils/","title":"SchedulerBackendUtils Utility","text":""},{"location":"scheduler/SchedulerBackendUtils/#default-number-of-executors","title":"Default Number of Executors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SchedulerBackendUtils defaults to 2 as the default number of executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/SchedulerBackendUtils/#getinitialtargetexecutornumber","title":"getInitialTargetExecutorNumber
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getInitialTargetExecutorNumber(\n  conf: SparkConf,\n  numExecutors: Int = DEFAULT_NUMBER_EXECUTORS): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getInitialTargetExecutorNumber branches off based on whether Dynamic Allocation of Executors is enabled or not.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          With no Dynamic Allocation of Executors, getInitialTargetExecutorNumber uses the spark.executor.instances configuration property (if defined) or uses the given numExecutors (and the DEFAULT_NUMBER_EXECUTORS).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          With Dynamic Allocation of Executors enabled, getInitialTargetExecutorNumber getDynamicAllocationInitialExecutors and makes sure that the value is between the following configuration properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • spark.dynamicAllocation.minExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • spark.dynamicAllocation.maxExecutors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getInitialTargetExecutorNumber is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • KubernetesClusterSchedulerBackend (Spark on Kubernetes) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Spark on YARN's YarnAllocator, YarnClientSchedulerBackend and YarnClusterSchedulerBackend are used
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/SchedulingMode/","title":"SchedulingMode","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[SchedulingMode]] Scheduling Mode -- spark.scheduler.mode Spark Property

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Scheduling Mode (aka order task policy or scheduling policy or scheduling order) defines a policy to sort tasks in order for execution.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The scheduling mode schedulingMode attribute is part of the scheduler:TaskScheduler.md#schedulingMode[TaskScheduler Contract].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The only implementation of the TaskScheduler contract in Spark -- scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl] -- uses configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] setting to configure schedulingMode that is merely used to set up the scheduler:TaskScheduler.md#rootPool[rootPool] attribute (with FIFO being the default). It happens when scheduler:TaskSchedulerImpl.md#initialize[TaskSchedulerImpl is initialized].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          There are three acceptable scheduling modes:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[FIFO]] FIFO with no pools but a single top-level unnamed pool with elements being scheduler:TaskSetManager.md[TaskSetManager] objects; lower priority gets scheduler:spark-scheduler-Schedulable.md[Schedulable] sooner or earlier stage wins.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[FAIR]] FAIR with a scheduler:spark-scheduler-FairSchedulableBuilder.md#buildPools[hierarchy of Schedulable (sub)pools] with the scheduler:TaskScheduler.md#rootPool[rootPool] at the top.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[NONE]] NONE (not used)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: Out of three possible SchedulingMode policies only FIFO and FAIR modes are supported by scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/SchedulingMode/#note","title":"[NOTE]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          After the root pool is initialized, the scheduling mode is no longer relevant (since the spark-scheduler-Schedulable.md[Schedulable] that represents the root pool is fully set up).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/SchedulingMode/#the-root-pool-is-later-used-when-schedulertaskschedulerimplmdsubmittaskstaskschedulerimpl-submits-tasks-as-tasksets-for-execution","title":"The root pool is later used when scheduler:TaskSchedulerImpl.md#submitTasks[TaskSchedulerImpl submits tasks (as TaskSets) for execution].","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: The scheduler:TaskScheduler.md#rootPool[root pool] is a Schedulable. Refer to spark-scheduler-Schedulable.md[Schedulable].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[fair-scheduling-sparkui]] Monitoring FAIR Scheduling Mode using Spark UI

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          CAUTION: FIXME Describe me...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/ShuffleMapStage/","title":"ShuffleMapStage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleMapStage (shuffle map stage or simply map stage) is a Stage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleMapStage corresponds to (and is associated with) a ShuffleDependency.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleMapStage can be submitted independently but it is usually an intermediate step in a physical execution plan (with the final step being a ResultStage).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/ShuffleMapStage/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleMapStage takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Stage ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • RDD (of the ShuffleDependency)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Number of tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Parent Stages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • First Job ID (of the ActiveJob that created it)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • CallSite
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ShuffleDependency
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • MapOutputTrackerMaster
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Resource Profile ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ShuffleMapStage is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to plan a ShuffleDependency for execution
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/ShuffleMapStage/#missing-partitions","title":"Missing Partitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            findMissingPartitions(): Seq[Int]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            findMissingPartitions requests the MapOutputTrackerMaster for the missing partitions (of the ShuffleDependency) and returns them.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If not available (MapOutputTrackerMaster does not track the ShuffleDependency), findMissingPartitions simply assumes that all the partitions are missing.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            findMissingPartitions is part of the Stage abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/ShuffleMapStage/#shufflemapstage-ready","title":"ShuffleMapStage Ready

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When \"executed\", a ShuffleMapStage saves map output files (for reduce tasks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When all partitions have shuffle map outputs available, ShuffleMapStage is considered ready (done or available).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/ShuffleMapStage/#isavailable","title":"isAvailable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isAvailable: Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isAvailable is true when the ShuffleMapStage is ready and all partitions have shuffle outputs (i.e. the numAvailableOutputs is exactly the numPartitions).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isAvailable is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to getMissingParentStages, handleMapStageSubmitted, submitMissingTasks, processShuffleMapStageCompletion, markMapStageJobsAsFinished and stageDependsOn
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/ShuffleMapStage/#available-outputs","title":"Available Outputs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            numAvailableOutputs: Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            numAvailableOutputs requests the MapOutputTrackerMaster to getNumAvailableOutputs (for the shuffleId of the ShuffleDependency).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            numAvailableOutputs is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to submitMissingTasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffleMapStage is requested to isAvailable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/ShuffleMapStage/#active-jobs","title":"Active Jobs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ShuffleMapStage defines _mapStageJobs internal registry of ActiveJobs to track jobs that were submitted to execute the stage independently.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A new job is registered (added) in addActiveJob.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            An active job is deregistered (removed) in removeActiveJob.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/ShuffleMapStage/#addactivejob","title":"addActiveJob
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            addActiveJob(\n  job: ActiveJob): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            addActiveJob adds the given ActiveJob to (the front of) the _mapStageJobs list.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            addActiveJob is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to handleMapStageSubmitted
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/ShuffleMapStage/#removeactivejob","title":"removeActiveJob
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            removeActiveJob(\n  job: ActiveJob): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            removeActiveJob removes the ActiveJob from the _mapStageJobs registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            removeActiveJob is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to cleanupStateForJobAndIndependentStages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/ShuffleMapStage/#mapstagejobs","title":"mapStageJobs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            mapStageJobs: Seq[ActiveJob]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            mapStageJobs returns the _mapStageJobs list.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            mapStageJobs is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested to markMapStageJobsAsFinished
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/ShuffleMapStage/#demo-shufflemapstage-sharing","title":"Demo: ShuffleMapStage Sharing

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A ShuffleMapStage can be shared across multiple jobs (if these jobs reuse the same RDDs).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            val keyValuePairs = sc.parallelize(0 to 5).map((_, 1))\nval rdd = keyValuePairs.sortByKey()  // (1)\n\nscala> println(rdd.toDebugString)\n(6) ShuffledRDD[4] at sortByKey at <console>:39 []\n +-(16) MapPartitionsRDD[1] at map at <console>:39 []\n    |   ParallelCollectionRDD[0] at parallelize at <console>:39 []\n\nrdd.count  // (2)\nrdd.count  // (3)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. Shuffle at sortByKey()
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. Submits a job with two stages (and two to be executed)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            3. Intentionally repeat the last action that submits a new job with two stages with one being shared as already-computed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/ShuffleMapStage/#map-output-files","title":"Map Output Files

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ShuffleMapStage writes out map output files (for a shuffle).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"scheduler/ShuffleMapTask/","title":"ShuffleMapTask","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ShuffleMapTask is a Task to produce a MapStatus (Task[MapStatus]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ShuffleMapTask is one of the two types of Tasks. When executed, ShuffleMapTask writes the result of executing a serialized task code over the records (of a RDD partition) to the shuffle system and returns a MapStatus (with the BlockManager and estimated size of the result shuffle blocks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/ShuffleMapTask/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ShuffleMapTask takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Stage ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Stage Attempt ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Broadcast variable with a serialized task binary
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Partition
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskLocations
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Local Properties
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Serialized task metrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Job ID (default: None)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Application ID (default: None)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Application Attempt ID (default: None)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • isBarrier flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffleMapTask is created when DAGScheduler is requested to submit tasks for all missing partitions of a ShuffleMapStage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"scheduler/ShuffleMapTask/#isBarrier","title":"isBarrier Flag","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ShuffleMapTask can be given isBarrier flag when created. Unless given, isBarrier is assumed disabled (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              isBarrier flag is passed to the parent Task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"scheduler/ShuffleMapTask/#serialized-task-binary","title":"Serialized Task Binary
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              taskBinary: Broadcast[Array[Byte]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ShuffleMapTask is given a broadcast variable with a reference to a serialized task binary.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              runTask expects that the serialized task binary is a tuple of an RDD and a ShuffleDependency.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/ShuffleMapTask/#preferred-locations","title":"Preferred Locations Signature
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              preferredLocations: Seq[TaskLocation]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              preferredLocations is part of the Task abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              preferredLocations returns preferredLocs internal property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ShuffleMapTask tracks TaskLocations as unique entries in the given locs (with the only rule that when locs is not defined, it is empty, and no task location preferences are defined).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ShuffleMapTask initializes the preferredLocs internal property when created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/ShuffleMapTask/#running-task","title":"Running Task Signature
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              runTask(\n  context: TaskContext): MapStatus\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              runTask is part of the Task abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              runTask writes the result (records) of executing the serialized task code over the records (in the RDD partition) to the shuffle system and returns a MapStatus (with the BlockManager and an estimated size of the result shuffle blocks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Internally, runTask requests the SparkEnv for the new instance of closure serializer and requests it to deserialize the serialized task code (into a tuple of a RDD and a ShuffleDependency).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              runTask measures the thread and CPU deserialization times.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              runTask requests the SparkEnv for the ShuffleManager and requests it for a ShuffleWriter (for the ShuffleHandle and the partition).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              runTask then requests the RDD for the records (of the partition) that the ShuffleWriter is requested to write out (to the shuffle system).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, runTask requests the ShuffleWriter to stop (with the success flag on) and returns the shuffle map output status.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              This is the moment in Task's lifecycle (and its corresponding RDD) when a RDD partition is computed and in turn becomes a sequence of records (i.e. real data) on an executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In case of any exceptions, runTask requests the ShuffleWriter to stop (with the success flag off) and (re)throws the exception.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              runTask may also print out the following DEBUG message to the logs when the ShuffleWriter could not be stopped.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Could not stop writer\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/ShuffleMapTask/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Enable ALL logging level for org.apache.spark.scheduler.ShuffleMapTask logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              logger.ShuffleMapTask.name = org.apache.spark.scheduler.ShuffleMapTask\nlogger.ShuffleMapTask.level = all\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/ShuffleStatus/","title":"ShuffleStatus","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ShuffleStatus is a registry of MapStatuses per Partition of a ShuffleMapStage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ShuffleStatus is used by MapOutputTrackerMaster.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"scheduler/ShuffleStatus/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ShuffleStatus takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Number of Partitions (of the RDD of the ShuffleDependency of a ShuffleMapStage)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ShuffleStatus is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MapOutputTrackerMaster is requested to register a shuffle (when DAGScheduler is requested to create a ShuffleMapStage)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/ShuffleStatus/#mapstatuses-per-partition","title":"MapStatuses per Partition

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ShuffleStatus creates a mapStatuses internal registry of MapStatuses per partition (using the numPartitions) when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                A missing partition is when there is no MapStatus for a partition (null at the index of the partition ID) and can be requested using findMissingPartitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                mapStatuses is all null (for every partition) initially (and so all partitions are missing / uncomputed).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                A new MapStatus is added in addMapOutput and updateMapOutput.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                A MapStatus is removed (nulled) in removeMapOutput and removeOutputsByFilter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The number of available MapStatuses is tracked by _numAvailableMapOutputs internal counter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • serializedMapStatus and withMapStatuses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/ShuffleStatus/#registering-shuffle-map-output","title":"Registering Shuffle Map Output
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                addMapOutput(\n  mapIndex: Int,\n  status: MapStatus): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                addMapOutput adds the MapStatus to the mapStatuses internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In case the mapStatuses internal registry had no MapStatus for the mapIndex already available, addMapOutput increments the _numAvailableMapOutputs internal counter and invalidateSerializedMapOutputStatusCache.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                addMapOutput\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MapOutputTrackerMaster is requested to registerMapOutput
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/ShuffleStatus/#deregistering-shuffle-map-output","title":"Deregistering Shuffle Map Output
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                removeMapOutput(\n  mapIndex: Int,\n  bmAddress: BlockManagerId): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                removeMapOutput...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                removeMapOutput\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MapOutputTrackerMaster is requested to unregisterMapOutput
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/ShuffleStatus/#missing-partitions","title":"Missing Partitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                findMissingPartitions(): Seq[Int]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                findMissingPartitions...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                findMissingPartitions\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MapOutputTrackerMaster is requested to findMissingPartitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/ShuffleStatus/#serializing-shuffle-map-output-statuses","title":"Serializing Shuffle Map Output Statuses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                serializedMapStatus(\n  broadcastManager: BroadcastManager,\n  isLocal: Boolean,\n  minBroadcastSize: Int,\n  conf: SparkConf): Array[Byte]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                serializedMapStatus...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                serializedMapStatus\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MessageLoop (of the MapOutputTrackerMaster) is requested to send map output locations for shuffle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/ShuffleStatus/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Enable ALL logging level for org.apache.spark.ShuffleStatus logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                log4j.logger.org.apache.spark.ShuffleStatus=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/Stage/","title":"Stage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Stage is an abstraction of steps in a physical execution plan.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The logical DAG or logical execution plan is the RDD lineage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Indirectly, a Stage is a set of parallel tasks - one task per partition (of an RDD that computes partial results of a function executed as part of a Spark job).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In other words, a Spark job is a computation \"sliced\" (not to use the reserved term partitioned) into stages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/Stage/#contract","title":"Contract","text":""},{"location":"scheduler/Stage/#missing-partitions","title":"Missing Partitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                findMissingPartitions(): Seq[Int]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Missing partitions (IDs of the partitions of the RDD that are missing and need to be computed)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • DAGScheduler is requested to submit missing tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/Stage/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ResultStage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ShuffleMapStage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/Stage/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Stage takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Stage ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • RDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Number of tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Parent Stages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • First Job ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • CallSite
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Resource Profile ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Stage is an abstract class and cannot be created directly. It is created indirectly for the concrete Stages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"scheduler/Stage/#rdd","title":"RDD

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Stage is given a RDD when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/Stage/#stage-id","title":"Stage ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Stage is given an unique ID when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  DAGScheduler uses nextStageId internal counter to track the number of stage submissions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/Stage/#making-new-stage-attempt","title":"Making New Stage Attempt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  makeNewStageAttempt(\n  numPartitionsToCompute: Int,\n  taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  makeNewStageAttempt creates a new TaskMetrics and requests it to register itself with the SparkContext of the RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  makeNewStageAttempt creates a StageInfo from this Stage (and the nextAttemptId). This StageInfo is saved in the _latestInfo internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, makeNewStageAttempt increments the nextAttemptId internal counter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  makeNewStageAttempt returns Unit (nothing) and its purpose is to update the latest StageInfo internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  makeNewStageAttempt\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • DAGScheduler is requested to submit the missing tasks of a stage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/StageInfo/","title":"StageInfo","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  StageInfo is a metadata about a stage to pass from the scheduler to SparkListeners.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"scheduler/StageInfo/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  StageInfo takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Stage ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Stage Attempt ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Number of Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • RDDInfos
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Parent IDs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Details
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TaskMetrics (default: null)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Task Locality Preferences (default: empty)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Optional Shuffle Dependency ID (default: undefined)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    StageInfo is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • StageInfo utility is used to fromStage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • JsonProtocol (History Server) is used to stageInfoFromJson
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/StageInfo/#fromstage-utility","title":"fromStage Utility
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    fromStage(\n  stage: Stage,\n  attemptId: Int,\n  numTasks: Option[Int] = None,\n  taskMetrics: TaskMetrics = null,\n  taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty): StageInfo\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    fromStage...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    fromStage\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Stage is created and make a new Stage attempt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/Task/","title":"Task","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Task is an abstraction of the smallest individual units of execution that can be executed (to compute an RDD partition).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/Task/#contract","title":"Contract","text":""},{"location":"scheduler/Task/#running-task","title":"Running Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    runTask(\n  context: TaskContext): T\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Runs the task (in a TaskContext)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when Task is requested to run

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/Task/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ResultTask
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleMapTask
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/Task/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Task takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Stage ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Stage (execution) Attempt ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Partition ID to compute
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Local Properties
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Serialized TaskMetrics (Array[Byte])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ActiveJob ID (default: None)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Application ID (default: None)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Application Attempt ID (default: None)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • isBarrier flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Task is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • DAGScheduler is requested to submit missing tasks of a stage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Task\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete Tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/Task/#isBarrier","title":"isBarrier Flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Task can be given isBarrier flag when created. Unless given, isBarrier is assumed disabled (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      isBarrier flag indicates whether this Task belongs to a Barrier Stage in Barrier Execution Mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      isBarrier flag is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • DAGScheduler is requested to handleTaskCompletion (of a FetchFailed task) to fail the parent stage (and retry a barrier stage when one of the barrier tasks fails)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Task is requested to run (to create a BarrierTaskContext)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskSetManager is requested to isBarrier and handleFailedTask
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/Task/#taskmemorymanager","title":"TaskMemoryManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Task is given a TaskMemoryManager when TaskRunner is requested to run a task (right after deserializing the task for execution).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Task uses the TaskMemoryManager to create a TaskContextImpl (when requested to run).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/Task/#serializable","title":"Serializable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Task is a Serializable (Java) so it can be serialized (to bytes) and send over the wire for execution from the driver to executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/Task/#preferred-locations","title":"Preferred Locations
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      preferredLocations: Seq[TaskLocation]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskLocations that represent preferred locations (executors) to execute the task on.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Empty by default and so no task location preferences are defined that says the task could be launched on any executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Defined by the concrete tasks (i.e. ShuffleMapTask and ResultTask).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      preferredLocations is used when TaskSetManager is requested to register a task as pending execution and dequeueSpeculativeTask.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/Task/#run","title":"Running Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run(\n  taskAttemptId: Long,\n  attemptNumber: Int,\n  metricsSystem: MetricsSystem,\n  resources: Map[String, ResourceInformation],\n  plugins: Option[PluginContainer]): T\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run registers the task (attempt) with the BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run creates a TaskContextImpl (and perhaps a BarrierTaskContext too when the given isBarrier flag is enabled) that in turn becomes the task's TaskContext.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run checks _killed flag and, if enabled, kills the task (with interruptThread flag disabled).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run creates a Hadoop CallerContext and sets it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run informs the given PluginContainer that the task is started.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run runs the task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      This is the moment when the custom Task's runTask is executed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In the end, run notifies TaskContextImpl that the task has completed (regardless of the final outcome -- a success or a failure).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In case of any exceptions, run notifies TaskContextImpl that the task has failed. run requests MemoryStore to release unroll memory for this task (for both ON_HEAP and OFF_HEAP memory modes).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run uses SparkEnv to access the current BlockManager that it uses to access MemoryStore.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run requests MemoryManager to notify any tasks waiting for execution memory to be freed to wake up and try to acquire memory again.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run unsets the task's TaskContext.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run uses SparkEnv to access the current MemoryManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      run is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskRunner is requested to run (when Executor is requested to launch a task (on \"Executor task launch worker\" thread pool sometime in the future))
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/Task/#task-states","title":"Task States

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Task can be in one of the following states (as described by TaskState enumeration):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • LAUNCHING
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • RUNNING when the task is being started.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • FINISHED when the task finished with the serialized result.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • FAILED when the task fails, e.g. when FetchFailedException, CommitDeniedException or any Throwable occurs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • KILLED when an executor kills a task.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • LOST

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      States are the values of org.apache.spark.TaskState.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Task status updates are sent from executors to the driver through ExecutorBackend.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Task is finished when it is in one of FINISHED, FAILED, KILLED, LOST.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      LOST and FAILED states are considered failures.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/Task/#collecting-latest-values-of-accumulators","title":"Collecting Latest Values of Accumulators
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      collectAccumulatorUpdates(\n  taskFailed: Boolean = false): Seq[AccumulableInfo]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      collectAccumulatorUpdates collects the latest values of internal and external accumulators from a task (and returns the values as a collection of AccumulableInfo).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Internally, collectAccumulatorUpdates takes TaskMetrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      collectAccumulatorUpdates uses TaskContextImpl to access the task's TaskMetrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      collectAccumulatorUpdates collects the latest values of:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • internal accumulators whose current value is not the zero value and the RESULT_SIZE accumulator (regardless whether the value is its zero or not).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • external accumulators when taskFailed is disabled (false) or which should be included on failures.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      collectAccumulatorUpdates returns an empty collection when TaskContextImpl is not initialized.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      collectAccumulatorUpdates is used when TaskRunner runs a task (and sends a task's final results back to the driver).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/Task/#killing-task","title":"Killing Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      kill(\n  interruptThread: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      kill marks the task to be killed, i.e. it sets the internal _killed flag to true.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      kill calls TaskContextImpl.markInterrupted when context is set.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      If interruptThread is enabled and the internal taskThread is available, kill interrupts it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      CAUTION: FIXME When could context and interruptThread not be set?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskContext/","title":"TaskContext","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskContext is an abstraction of task contexts.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskContext/#contract-subset","title":"Contract (Subset)","text":""},{"location":"scheduler/TaskContext/#addtaskcompletionlistener","title":"addTaskCompletionListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      addTaskCompletionListener[U](\n  f: (TaskContext) => U): TaskContext\naddTaskCompletionListener(\n  listener: TaskCompletionListener): TaskContext\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Registers a TaskCompletionListener

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      val rdd = sc.range(0, 5, numSlices = 1)\n\nimport org.apache.spark.TaskContext\nval printTaskInfo = (tc: TaskContext) => {\n  val msg = s\"\"\"|-------------------\n                |partitionId:   ${tc.partitionId}\n                |stageId:       ${tc.stageId}\n                |attemptNum:    ${tc.attemptNumber}\n                |taskAttemptId: ${tc.taskAttemptId}\n                |-------------------\"\"\".stripMargin\n  println(msg)\n}\n\nrdd.foreachPartition { _ =>\n  val tc = TaskContext.get\n  tc.addTaskCompletionListener(printTaskInfo)\n}\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskContext/#addtaskfailurelistener","title":"addTaskFailureListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      addTaskFailureListener(\n  f: (TaskContext, Throwable) => Unit): TaskContext\naddTaskFailureListener(\n  listener: TaskFailureListener): TaskContext\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Registers a TaskFailureListener

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      val rdd = sc.range(0, 2, numSlices = 2)\n\nimport org.apache.spark.TaskContext\nval printTaskErrorInfo = (tc: TaskContext, error: Throwable) => {\n  val msg = s\"\"\"|-------------------\n                |partitionId:   ${tc.partitionId}\n                |stageId:       ${tc.stageId}\n                |attemptNum:    ${tc.attemptNumber}\n                |taskAttemptId: ${tc.taskAttemptId}\n                |error:         ${error.toString}\n                |-------------------\"\"\".stripMargin\n  println(msg)\n}\n\nval throwExceptionForOddNumber = (n: Long) => {\n  if (n % 2 == 1) {\n    throw new Exception(s\"No way it will pass for odd number: $n\")\n  }\n}\n\n// FIXME It won't work.\nrdd.map(throwExceptionForOddNumber).foreachPartition { _ =>\n  val tc = TaskContext.get\n  tc.addTaskFailureListener(printTaskErrorInfo)\n}\n\n// Listener registration matters.\nrdd.mapPartitions { (it: Iterator[Long]) =>\n  val tc = TaskContext.get\n  tc.addTaskFailureListener(printTaskErrorInfo)\n  it\n}.map(throwExceptionForOddNumber).count\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskContext/#fetchfailed","title":"fetchFailed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      fetchFailed: Option[FetchFailedException]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskRunner is requested to run
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskContext/#getkillreason","title":"getKillReason
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getKillReason(): Option[String]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskContext/#getlocalproperty","title":"getLocalProperty
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getLocalProperty(\n  key: String): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Looks up a local property by key

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskContext/#getmetricssources","title":"getMetricsSources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getMetricsSources(\n  sourceName: String): Seq[Source]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Looks up Sources by name

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskContext/#registering-accumulator","title":"Registering Accumulator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      registerAccumulator(\n  a: AccumulatorV2[_, _]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Registers a AccumulatorV2

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • AccumulatorV2 is requested to deserialize itself
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskContext/#resources","title":"Resources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      resources(): Map[String, ResourceInformation]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Resources (names) allocated to this task

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      See:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskContextImpl
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskContext/#taskmetrics","title":"taskMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      taskMetrics(): TaskMetrics\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskMetrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskContext/#others","title":"others

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Important

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      There are other methods, but don't seem very interesting.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskContext/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • BarrierTaskContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskContextImpl
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskContext/#serializable","title":"Serializable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskContext is a Serializable (Java).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskContext/#accessing-taskcontext","title":"Accessing TaskContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      get(): TaskContext\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      get returns the thread-local TaskContext instance.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      import org.apache.spark.TaskContext\nval tc = TaskContext.get\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      val rdd = sc.range(0, 3, numSlices = 3)\n\nassert(rdd.partitions.size == 3)\n\nrdd.foreach { n =>\n  import org.apache.spark.TaskContext\n  val tc = TaskContext.get\n  val msg = s\"\"\"|-------------------\n                |partitionId:   ${tc.partitionId}\n                |stageId:       ${tc.stageId}\n                |attemptNum:    ${tc.attemptNumber}\n                |taskAttemptId: ${tc.taskAttemptId}\n                |-------------------\"\"\".stripMargin\n  println(msg)\n}\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"scheduler/TaskContextImpl/","title":"TaskContextImpl","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskContextImpl is a concrete TaskContext.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskContextImpl/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskContextImpl takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Stage ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Stage Execution Attempt ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Partition ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Task Execution Attempt ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Attempt Number
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskMemoryManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Local Properties
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • MetricsSystem
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Resources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskContextImpl is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Task is requested to run
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/TaskContextImpl/#resources","title":"Resources","text":"TaskContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        resources: Map[String, ResourceInformation]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        resources is part of the TaskContext abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskContextImpl can be given resources (names) when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The resources are given when a Task is requested to run that in turn come from a TaskDescription (of a TaskRunner).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/TaskContextImpl/#barriertaskcontext","title":"BarrierTaskContext

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskContextImpl is available to barrier tasks as a BarrierTaskContext.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"scheduler/TaskDescription/","title":"TaskDescription","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskDescription is a metadata of a Task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"scheduler/TaskDescription/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TaskDescription takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Task ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Task attempt number
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Task name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Task index (within the TaskSet)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Partition ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Added files (as Map[String, Long])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Added JAR files (as Map[String, Long])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Properties
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Resources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Serialized task (as ByteBuffer)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          TaskDescription is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskSetManager is requested to find a task ready for execution (given a resource offer)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/TaskDescription/#resources","title":"Resources","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          resources: Map[String, ResourceInformation]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          TaskDescription is given resources when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The resources are either specified when TaskSetManager is requested to resourceOffer (and prepareLaunchingTask) or decoded from bytes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"scheduler/TaskDescription/#text-representation","title":"Text Representation
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          toString: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          toString uses the taskId and index as follows:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          TaskDescription(TID=[taskId], index=[index])\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskDescription/#decoding-taskdescription-from-serialized-format","title":"Decoding TaskDescription (from Serialized Format)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          decode(\n  byteBuffer: ByteBuffer): TaskDescription\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          decode simply decodes (<>) a TaskDescription from the serialized format (ByteBuffer).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Internally, decode...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          decode is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • CoarseGrainedExecutorBackend is requested to CoarseGrainedExecutorBackend.md#LaunchTask[handle a LaunchTask message]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Spark on Mesos' MesosExecutorBackend is requested to spark-on-mesos:spark-executor-backends-MesosExecutorBackend.md#launchTask[launch a task]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskDescription/#encoding-taskdescription-to-serialized-format","title":"Encoding TaskDescription (to Serialized Format)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          encode(\n  taskDescription: TaskDescription): ByteBuffer\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          encode simply encodes the TaskDescription to a serialized format (ByteBuffer).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Internally, encode...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          encode is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • DriverEndpoint (of CoarseGrainedSchedulerBackend) is requested to launchTasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskDescription/#task-name","title":"Task Name

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The name of the task is of the format:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          task [taskID] in stage [taskSetID]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"scheduler/TaskInfo/","title":"TaskInfo","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[TaskInfo]] TaskInfo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          TaskInfo is information about a running task attempt inside a scheduler:TaskSet.md[TaskSet].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          TaskInfo is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • scheduler:TaskSetManager.md#resourceOffer[TaskSetManager dequeues a task for execution (given resource offer)] (and records the task as running)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskUIData does dropInternalAndSQLAccumulables

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • JsonProtocol utility is used to spark-history-server:JsonProtocol.md#taskInfoFromJson[re-create a task details from JSON]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: Back then, at the commit 63051dd2bcc4bf09d413ff7cf89a37967edc33ba, when TaskInfo was first merged to Apache Spark on 07/06/12, TaskInfo was part of spark.scheduler.mesos package -- note \"Mesos\" in the name of the package that shows how much Spark and Mesos influenced each other at that time.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[internal-registries]] .TaskInfo's Internal Registries and Counters [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [[finishTime]] finishTime | Time when TaskInfo was <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when...FIXME |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[creating-instance]] Creating TaskInfo Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          TaskInfo takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[taskId]] Task ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[index]] Index of the task within its scheduler:TaskSet.md[TaskSet] that may not necessarily be the same as the ID of the RDD partition that the task is computing.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[attemptNumber]] Task attempt ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[launchTime]] Time when the task was dequeued for execution
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[executorId]] Executor that has been offered (as a resource) to run the task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[host]] Host of the <>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[taskLocality]] scheduler:TaskSchedulerImpl.md#TaskLocality[TaskLocality], i.e. locality preference of the task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[speculative]] Flag whether a task is speculative or not
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskInfo initializes the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[markFinished]] Marking Task As Finished (Successfully or Not) -- markFinished Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/TaskInfo/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/TaskInfo/#markfinishedstate-taskstate-time-long-systemcurrenttimemillis-unit","title":"markFinished(state: TaskState, time: Long = System.currentTimeMillis): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            markFinished records the input time as <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            markFinished marks TaskInfo as <> when the input state is FAILED or <> for state being KILLED.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: markFinished is used when TaskSetManager is notified that a task has finished scheduler:TaskSetManager.md#handleSuccessfulTask[successfully] or scheduler:TaskSetManager.md#handleFailedTask[failed].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/TaskLocation/","title":"TaskLocation","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TaskLocation represents a placement preference of an RDD partition, i.e. a hint of the location to submit scheduler:Task.md[tasks] for execution.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TaskLocations are tracked by scheduler:DAGScheduler.md#cacheLocs[DAGScheduler] for scheduler:DAGScheduler.md#submitMissingTasks[submitting missing tasks of a stage].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TaskLocation is available as scheduler:Task.md#preferredLocations[preferredLocations] of a task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            [[host]] Every TaskLocation describes the location by host name, but could also use other location-related metadata.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TaskLocations of an RDD and a partition is available using SparkContext.md#getPreferredLocs[SparkContext.getPreferredLocs] method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Sealed

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TaskLocation is a Scala private[spark] sealed trait so all the available implementations of TaskLocation trait are in a single Scala file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[ExecutorCacheTaskLocation]] ExecutorCacheTaskLocation

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExecutorCacheTaskLocation describes a <> and an executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExecutorCacheTaskLocation informs the Scheduler to prefer a given executor, but the next level of preference is any executor on the same host if this is not possible.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[HDFSCacheTaskLocation]] HDFSCacheTaskLocation

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            HDFSCacheTaskLocation describes a <> that is cached by HDFS.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used exclusively when rdd:HadoopRDD.md#getPreferredLocations[HadoopRDD] and rdd:NewHadoopRDD.md#getPreferredLocations[NewHadoopRDD] are requested for their placement preferences (aka preferred locations).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[HostTaskLocation]] HostTaskLocation

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            HostTaskLocation describes a <> only."},{"location":"scheduler/TaskResult/","title":"TaskResult","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TaskResult is an abstraction of task results (of type T).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The decision what TaskResult type to use is made when TaskRunner finishes running a task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Sealed Trait

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TaskResult is a Scala sealed trait which means that all of the implementations are in the same compilation unit (a single file).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"scheduler/TaskResult/#directtaskresult","title":"DirectTaskResult

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DirectTaskResult is a TaskResult to be serialized and sent over the wire to the driver together with the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Value Bytes (java.nio.ByteBuffer)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Accumulator updates
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Metric Peaks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              DirectTaskResult is used when the size of a task result is below spark.driver.maxResultSize and the maximum size of direct results.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"scheduler/TaskResult/#indirecttaskresult","title":"IndirectTaskResult

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              IndirectTaskResult is a \"pointer\" to a task result that is available in a BlockManager:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockId
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Size

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                IndirectTaskResult is a java.io.Serializable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/TaskResult/#externalizable","title":"Externalizable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                DirectTaskResult is an Externalizable (Java).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/TaskResultGetter/","title":"TaskResultGetter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                TaskResultGetter is a helper class of scheduler:TaskSchedulerImpl.md#statusUpdate[TaskSchedulerImpl] for asynchronous deserialization of <> (possibly fetching remote blocks) or <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CAUTION: FIXME Image with the dependencies

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                TIP: Consult scheduler:Task.md#states[Task States] in Tasks to learn about the different task states.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: The only instance of TaskResultGetter is created while scheduler:TaskSchedulerImpl.md#creating-instance[TaskSchedulerImpl is created].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                TaskResultGetter requires a core:SparkEnv.md[SparkEnv] and scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl] to be created and is stopped when scheduler:TaskSchedulerImpl.md#stop[TaskSchedulerImpl stops].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                TaskResultGetter uses <task-result-getter asynchronous task executor>> for operation."},{"location":"scheduler/TaskResultGetter/#tip","title":"[TIP]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Enable DEBUG logging level for org.apache.spark.scheduler.TaskResultGetter logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                log4j.logger.org.apache.spark.scheduler.TaskResultGetter=DEBUG\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskResultGetter/#refer-to-spark-loggingmdlogging","title":"Refer to spark-logging.md[Logging].","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                === [[getTaskResultExecutor]][[task-result-getter]] task-result-getter Asynchronous Task Executor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskResultGetter/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/TaskResultGetter/#gettaskresultexecutor-executorservice","title":"getTaskResultExecutor: ExecutorService","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getTaskResultExecutor creates a daemon thread pool with <> threads and task-result-getter prefix.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                TIP: Read up on https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html[java.util.concurrent.ThreadPoolExecutor] that getTaskResultExecutor uses under the covers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                === [[stop]] stop Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskResultGetter/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/TaskResultGetter/#stop-unit","title":"stop(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                stop stops the internal <task-result-getter asynchronous task executor>>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                === [[serializer]] serializer Attribute

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskResultGetter/#source-scala_2","title":"[source, scala]","text":""},{"location":"scheduler/TaskResultGetter/#serializer-threadlocalserializerinstance","title":"serializer: ThreadLocal[SerializerInstance]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                serializer is a thread-local serializer:SerializerInstance.md[SerializerInstance] that TaskResultGetter uses to deserialize byte buffers (with TaskResults or a TaskEndReason).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When created for a new thread, serializer is initialized with a new instance of Serializer (using core:SparkEnv.md#closureSerializer[SparkEnv.closureSerializer]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: TaskResultGetter uses https://docs.oracle.com/javase/8/docs/api/java/lang/ThreadLocal.html[java.lang.ThreadLocal] for the thread-local SerializerInstance variable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                === [[taskResultSerializer]] taskResultSerializer Attribute

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskResultGetter/#source-scala_3","title":"[source, scala]","text":""},{"location":"scheduler/TaskResultGetter/#taskresultserializer-threadlocalserializerinstance","title":"taskResultSerializer: ThreadLocal[SerializerInstance]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                taskResultSerializer is a thread-local serializer:SerializerInstance.md[SerializerInstance] that TaskResultGetter uses to...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When created for a new thread, taskResultSerializer is initialized with a new instance of Serializer (using core:SparkEnv.md#serializer[SparkEnv.serializer]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: TaskResultGetter uses https://docs.oracle.com/javase/8/docs/api/java/lang/ThreadLocal.html[java.lang.ThreadLocal] for the thread-local SerializerInstance variable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskResultGetter/#enqueuing-successful-task","title":"Enqueuing Successful Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                enqueueSuccessfulTask(\n  taskSetManager: TaskSetManager,\n  tid: Long,\n  serializedData: ByteBuffer): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                enqueueSuccessfulTask submits an asynchronous task (to <> asynchronous task executor) that first deserializes serializedData to a DirectTaskResult, then updates the internal accumulator (with the size of the DirectTaskResult) and ultimately notifies the TaskSchedulerImpl that the tid task was completed and scheduler:TaskSchedulerImpl.md#handleSuccessfulTask[the task result was received successfully] or scheduler:TaskSchedulerImpl.md#handleFailedTask[not].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: enqueueSuccessfulTask is just the asynchronous task enqueued for execution by <> asynchronous task executor at some point in the future.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Internally, the enqueued task first deserializes serializedData to a TaskResult (using the internal thread-local <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For a DirectTaskResult, the task scheduler:TaskSetManager.md#canFetchMoreResults[checks the available memory for the task result] and, when the size overflows configuration-properties.md#spark.driver.maxResultSize[spark.driver.maxResultSize], it simply returns.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                enqueueSuccessfulTask is a mere thread so returning from a thread is to do nothing else. That is why the check for quota does abort when there is not enough memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Otherwise, when there is enough memory to hold the task result, it deserializes the DirectTaskResult (using the internal thread-local <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For an IndirectTaskResult, the task checks the available memory for the task result and, when the size could overflow the maximum result size, it storage:BlockManagerMaster.md#removeBlock[removes the block] and simply returns.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Otherwise, when there is enough memory to hold the task result, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Fetching indirect task result for TID [tid]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The task scheduler:TaskSchedulerImpl.md#handleTaskGettingResult[notifies TaskSchedulerImpl that it is about to fetch a remote block for a task result]. It then storage:BlockManager.md#getRemoteBytes[gets the block from remote block managers (as serialized bytes)].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When the block could not be fetched, scheduler:TaskSchedulerImpl.md#handleFailedTask[TaskSchedulerImpl is informed] (with TaskResultLost task failure reason) and the task simply returns.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: enqueueSuccessfulTask is a mere thread so returning from a thread is to do nothing else and so the real handling is when scheduler:TaskSchedulerImpl.md#handleFailedTask[TaskSchedulerImpl is informed].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The task result (as a serialized byte buffer) is then deserialized to a DirectTaskResult (using the internal thread-local <>) and deserialized again using the internal thread-local <> (just like for the DirectTaskResult case). The storage:BlockManagerMaster.md#removeBlock[block is removed from BlockManagerMaster] and simply returns.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                A IndirectTaskResult is deserialized twice to become the final deserialized task result (using <> for a DirectTaskResult). Compare it to a DirectTaskResult task result that is deserialized once only.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                With no exceptions thrown, enqueueSuccessfulTask scheduler:TaskSchedulerImpl.md#handleSuccessfulTask[informs the TaskSchedulerImpl that the tid task was completed and the task result was received].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                A ClassNotFoundException leads to scheduler:TaskSetManager.md#abort[aborting the TaskSet] (with ClassNotFound with classloader: [loader] error message) while any non-fatal exception shows the following ERROR message in the logs followed by scheduler:TaskSetManager.md#abort[aborting the TaskSet].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Exception while getting task result\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                enqueueSuccessfulTask is used when TaskSchedulerImpl is requested to handle task status update (and the task has finished successfully).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                === [[enqueueFailedTask]] Deserializing TaskFailedReason and Notifying TaskSchedulerImpl -- enqueueFailedTask Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/TaskResultGetter/#source-scala_4","title":"[source, scala]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                enqueueFailedTask( taskSetManager: TaskSetManager, tid: Long, taskState: TaskState.TaskState, serializedData: ByteBuffer): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                enqueueFailedTask submits an asynchronous task (to <task-result-getter asynchronous task executor>>) that first attempts to deserialize a TaskFailedReason from serializedData (using the internal thread-local <>) and then scheduler:TaskSchedulerImpl.md#handleFailedTask[notifies TaskSchedulerImpl that the task has failed].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Any ClassNotFoundException leads to the following ERROR message in the logs (without breaking the flow of enqueueFailedTask):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ERROR Could not deserialize TaskEndReason: ClassNotFound with classloader [loader]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: enqueueFailedTask is called when scheduler:TaskSchedulerImpl.md#statusUpdate[TaskSchedulerImpl is notified about a task that has failed (and is in FAILED, KILLED or LOST state)].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                === [[settings]] Settings

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                .Spark Properties [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Spark Property | Default Value | Description | [[spark_resultGetter_threads]] spark.resultGetter.threads | 4 | The number of threads for TaskResultGetter. |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"scheduler/TaskScheduler/","title":"TaskScheduler","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                TaskScheduler is an abstraction of <> that can <> in a Spark application (per <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: TaskScheduler works closely with scheduler:DAGScheduler.md[DAGScheduler] that <> (for every stage in a Spark job).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                TaskScheduler can track the executors available in a Spark application using <> and <> interceptors (that inform about active and lost executors, respectively).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                == [[submitTasks]] Submitting Tasks for Execution

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                submitTasks( taskSet: TaskSet): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Submits the tasks (of the given scheduler:TaskSet.md[TaskSet]) for execution.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when DAGScheduler is requested to scheduler:DAGScheduler.md#submitMissingTasks[submit missing tasks (of a stage)].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                == [[executorHeartbeatReceived]] Handling Executor Heartbeat

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                executorHeartbeatReceived( execId: String, accumUpdates: Array[(Long, Seq[AccumulatorV2[_, _]])], blockManagerId: BlockManagerId): Boolean

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Handles a heartbeat from an executor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Returns true when the execId executor is managed by the TaskScheduler. false indicates that the executor:Executor.md#reportHeartBeat[block manager (on the executor) should re-register].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when HeartbeatReceiver RPC endpoint is requested to handle a Heartbeat (with task metrics) from an executor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                == [[killTaskAttempt]] Killing Task

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                killTaskAttempt( taskId: Long, interruptThread: Boolean, reason: String): Boolean

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Kills a task (attempt)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when DAGScheduler is requested to scheduler:DAGScheduler.md#killTaskAttempt[kill a task]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                == [[workerRemoved]] workerRemoved Notification

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                workerRemoved( workerId: String, host: String, message: String): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when DriverEndpoint is requested to handle a RemoveWorker event

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                == [[contract]] Contract

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                [cols=\"30m,70\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | applicationAttemptId a| [[applicationAttemptId]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#source-scala_4","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#applicationattemptid-optionstring","title":"applicationAttemptId(): Option[String]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Unique identifier of an (execution) attempt of the Spark application

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when SparkContext is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | cancelTasks a| [[cancelTasks]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#source-scala_5","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                cancelTasks( stageId: Int, interruptThread: Boolean): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Cancels all the tasks of a given Stage.md[stage]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when DAGScheduler is requested to DAGScheduler.md#failJobAndIndependentStages[failJobAndIndependentStages]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | defaultParallelism a| [[defaultParallelism]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#source-scala_6","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#defaultparallelism-int","title":"defaultParallelism(): Int","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Default level of parallelism

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when SparkContext is requested for the default level of parallelism

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | executorLost a| [[executorLost]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#source-scala_7","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                executorLost( executorId: String, reason: ExecutorLossReason): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Handles an executor lost event

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • HeartbeatReceiver RPC endpoint is requested to expireDeadHosts

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • DriverEndpoint RPC endpoint is requested to removes (forgets) and disables a malfunctioning executor (i.e. either lost or blacklisted for some reason)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | killAllTaskAttempts a| [[killAllTaskAttempts]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#source-scala_8","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                killAllTaskAttempts( stageId: Int, interruptThread: Boolean, reason: String): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • DAGScheduler is requested to DAGScheduler.md#handleTaskCompletion[handleTaskCompletion]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • TaskSchedulerImpl is requested to TaskSchedulerImpl.md#cancelTasks[cancel all the tasks of a stage]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | rootPool a| [[rootPool]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#source-scala_9","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#rootpool-pool","title":"rootPool: Pool","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Top-level (root) scheduler:spark-scheduler-Pool.md[schedulable pool]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • TaskSchedulerImpl is requested to scheduler:TaskSchedulerImpl.md#initialize[initialize]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkContext is requested to SparkContext.md#getAllPools[getAllPools] and SparkContext.md#getPoolForName[getPoolForName]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • TaskSchedulerImpl is requested to scheduler:TaskSchedulerImpl.md#resourceOffers[resourceOffers], scheduler:TaskSchedulerImpl.md#checkSpeculatableTasks[checkSpeculatableTasks], and scheduler:TaskSchedulerImpl.md#removeExecutor[removeExecutor]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | schedulingMode a| [[schedulingMode]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#source-scala_10","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#schedulingmode-schedulingmode","title":"schedulingMode: SchedulingMode","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                scheduler:spark-scheduler-SchedulingMode.md[Scheduling mode]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • TaskSchedulerImpl is scheduler:TaskSchedulerImpl.md#rootPool[created] and scheduler:TaskSchedulerImpl.md#initialize[initialized]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkContext is requested to SparkContext.md#getSchedulingMode[getSchedulingMode]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | setDAGScheduler a| [[setDAGScheduler]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#source-scala_11","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#setdagschedulerdagscheduler-dagscheduler-unit","title":"setDAGScheduler(dagScheduler: DAGScheduler): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Associates a scheduler:DAGScheduler.md[DAGScheduler]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when DAGScheduler is scheduler:DAGScheduler.md#creating-instance[created]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | start a| [[start]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#source-scala_12","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#start-unit","title":"start(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Starts the TaskScheduler

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when SparkContext is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | stop a| [[stop]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#source-scala_13","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#stop-unit","title":"stop(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Stops the TaskScheduler

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when DAGScheduler is requested to scheduler:DAGScheduler.md#stop[stop]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#lifecycle","title":"Lifecycle","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                A TaskScheduler is created while SparkContext is being created (by calling SparkContext.createTaskScheduler for a given master URL and deploy mode).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                At this point in SparkContext's lifecycle, the internal _taskScheduler points at the TaskScheduler (and it is \"announced\" by sending a blocking TaskSchedulerIsSet message to HeartbeatReceiver RPC endpoint).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The <> right after the blocking TaskSchedulerIsSet message receives a response.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The <> and the <> are set at this point (and SparkContext uses the application id to set SparkConf.md#spark.app.id[spark.app.id] Spark property, and configure webui:spark-webui-SparkUI.md[SparkUI], and storage:BlockManager.md[BlockManager]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CAUTION: FIXME The application id is described as \"associated with the job.\" in TaskScheduler, but I think it is \"associated with the application\" and you can have many jobs per application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Right before SparkContext is fully initialized, <> is called.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The internal _taskScheduler is cleared (i.e. set to null) while SparkContext.md#stop[SparkContext is being stopped].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <> while scheduler:DAGScheduler.md#stop[DAGScheduler is being stopped].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                WARNING: FIXME If it is SparkContext to start a TaskScheduler, shouldn't SparkContext stop it too? Why is this the way it is now?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                == [[postStartHook]] Post-Start Initialization

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#source-scala_14","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#poststarthook-unit","title":"postStartHook(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                postStartHook does nothing by default, but allows <> for some additional post-start initialization.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                postStartHook is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkContext is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Spark on YARN's YarnClusterScheduler is requested to spark-on-yarn:spark-yarn-yarnclusterscheduler.md#postStartHook[postStartHook]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                == [[applicationId]][[appId]] Unique Identifier of Spark Application

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskScheduler/#source-scala_15","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#applicationid-string","title":"applicationId(): String","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                applicationId is the unique identifier of the Spark application and defaults to spark-application-[currentTimeMillis].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                applicationId is used when SparkContext is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskSchedulerImpl/","title":"TaskSchedulerImpl","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                TaskSchedulerImpl is a TaskScheduler that uses a SchedulerBackend to schedule tasks (for execution on a cluster manager).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When a Spark application starts (and so an instance of SparkContext is created) TaskSchedulerImpl with a SchedulerBackend and DAGScheduler are created and soon started.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                TaskSchedulerImpl generates tasks based on executor resource offers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                TaskSchedulerImpl can track racks per host and port (that however is only used with Hadoop YARN cluster manager).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Using spark.scheduler.mode configuration property you can select the scheduling policy.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                TaskSchedulerImpl submits tasks using SchedulableBuilders.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"scheduler/TaskSchedulerImpl/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                TaskSchedulerImpl takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Maximum Number of Task Failures
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • isLocal flag (default: false)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Clock (default: SystemClock)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  While being created, TaskSchedulerImpl sets schedulingMode to the value of spark.scheduler.mode configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  schedulingMode is part of the TaskScheduler abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  TaskSchedulerImpl throws a SparkException for unrecognized scheduling mode:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Unrecognized spark.scheduler.mode: [schedulingModeConf]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, TaskSchedulerImpl creates a TaskResultGetter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  TaskSchedulerImpl is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is requested for a TaskScheduler (for local and spark master URLs)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • KubernetesClusterManager and MesosClusterManager are requested for a TaskScheduler
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"scheduler/TaskSchedulerImpl/#maxTaskFailures","title":"Maximum Number of Task Failures","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  TaskSchedulerImpl can be given the maximum number of task failures when created or default to spark.task.maxFailures configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The number of task failures is used when submitting tasks (to create a TaskSetManager).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"scheduler/TaskSchedulerImpl/#sparktaskcpus","title":"spark.task.cpus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  TaskSchedulerImpl uses spark.task.cpus configuration property for...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#backend","title":"SchedulerBackend
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  backend: SchedulerBackend\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  TaskSchedulerImpl is given a SchedulerBackend when requested to initialize.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The lifecycle of the SchedulerBackend is tightly coupled to the lifecycle of the TaskSchedulerImpl:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • It is started when TaskSchedulerImpl is
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • It is stopped when TaskSchedulerImpl is

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  TaskSchedulerImpl waits until the SchedulerBackend is ready before requesting it for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Reviving resource offers when requested to submitTasks, statusUpdate, handleFailedTask, checkSpeculatableTasks, and executorLost

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Killing tasks when requested to killTaskAttempt and killAllTaskAttempts

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Default parallelism, applicationId and applicationAttemptId when requested for the defaultParallelism, applicationId and applicationAttemptId, respectively

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#unique-identifier-of-spark-application","title":"Unique Identifier of Spark Application
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  applicationId(): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  applicationId is part of the TaskScheduler abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  applicationId simply request the SchedulerBackend for the applicationId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#cancelling-all-tasks-of-stage","title":"Cancelling All Tasks of Stage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  cancelTasks(\n  stageId: Int,\n  interruptThread: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  cancelTasks is part of the TaskScheduler abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  cancelTasks cancels all tasks submitted for execution in a stage stageId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  cancelTasks is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • DAGScheduler is requested to failJobAndIndependentStages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#handlesuccessfultask","title":"handleSuccessfulTask
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  handleSuccessfulTask(\n  taskSetManager: TaskSetManager,\n  tid: Long,\n  taskResult: DirectTaskResult[_]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  handleSuccessfulTask requests the given TaskSetManager to handleSuccessfulTask (with the given tid and taskResult).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  handleSuccessfulTask is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TaskResultGetter is requested to enqueueSuccessfulTask
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#handletaskgettingresult","title":"handleTaskGettingResult
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  handleTaskGettingResult(\n  taskSetManager: TaskSetManager,\n  tid: Long): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  handleTaskGettingResult requests the given TaskSetManager to handleTaskGettingResult.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  handleTaskGettingResult is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TaskResultGetter is requested to enqueueSuccessfulTask
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#initializing","title":"Initializing
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  initialize(\n  backend: SchedulerBackend): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  initialize initializes the TaskSchedulerImpl with the given SchedulerBackend.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  initialize saves the given SchedulerBackend.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  initialize then sets <Pool>> as an empty-named Pool.md[Pool] (passing in <>, initMinShare and initWeight as 0).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: <> and <> are a part of scheduler:TaskScheduler.md#contract[TaskScheduler Contract].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  initialize sets <> (based on <>):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • FIFOSchedulableBuilder.md[FIFOSchedulableBuilder] for FIFO scheduling mode
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • FairSchedulableBuilder.md[FairSchedulableBuilder] for FAIR scheduling mode

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  initialize SchedulableBuilder.md#buildPools[requests SchedulableBuilder to build pools].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  CAUTION: FIXME Why are rootPool and schedulableBuilder created only now? What do they need that it is not available when TaskSchedulerImpl is created?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: initialize is called while SparkContext.md#createTaskScheduler[SparkContext is created and creates SchedulerBackend and TaskScheduler].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#starting-taskschedulerimpl","title":"Starting TaskSchedulerImpl
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start starts the SchedulerBackend and the task-scheduler-speculation executor service.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#handling-task-status-update","title":"Handling Task Status Update
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  statusUpdate(\n  tid: Long,\n  state: TaskState,\n  serializedData: ByteBuffer): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  statusUpdate finds TaskSetManager for the input tid task (in <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When state is LOST, statusUpdate...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: TaskState.LOST is only used by the deprecated Mesos fine-grained scheduling mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When state is one of the scheduler:Task.md#states[finished states], i.e. FINISHED, FAILED, KILLED or LOST, statusUpdate <> for the input tid.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  statusUpdate scheduler:TaskSetManager.md#removeRunningTask[requests TaskSetManager to unregister tid from running tasks].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  statusUpdate requests <> to scheduler:TaskResultGetter.md#enqueueSuccessfulTask[schedule an asynchrounous task to deserialize the task result (and notify TaskSchedulerImpl back)] for tid in FINISHED state and scheduler:TaskResultGetter.md#enqueueFailedTask[schedule an asynchrounous task to deserialize TaskFailedReason (and notify TaskSchedulerImpl back)] for tid in the other finished states (i.e. FAILED, KILLED, LOST).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If a task is in LOST state, statusUpdate scheduler:DAGScheduler.md#executorLost[notifies DAGScheduler that the executor was lost] (with SlaveLost and the reason Task [tid] was lost, so marking the executor as lost as well.) and scheduler:SchedulerBackend.md#reviveOffers[requests SchedulerBackend to revive offers].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In case the TaskSetManager for tid could not be found (in <> registry), you should see the following ERROR message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Ignoring update with state [state] for TID [tid] because its task set is gone (this is likely the result of receiving duplicate task finished status updates)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Any exception is caught and reported as ERROR message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Exception in statusUpdate\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  CAUTION: FIXME image with scheduler backends calling TaskSchedulerImpl.statusUpdate.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  statusUpdate is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • DriverEndpoint (of CoarseGrainedSchedulerBackend) is requested to handle a StatusUpdate message

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • LocalEndpoint is requested to handle a StatusUpdate message

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#task-scheduler-speculation-scheduled-executor-service","title":"task-scheduler-speculation Scheduled Executor Service

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  speculationScheduler is a java.util.concurrent.ScheduledExecutorService with the name task-scheduler-speculation for Speculative Execution of Tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When TaskSchedulerImpl is requested to start (in non-local run mode) with spark.speculation enabled, speculationScheduler is used to schedule checkSpeculatableTasks to execute periodically every spark.speculation.interval.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  speculationScheduler is shut down when TaskSchedulerImpl is requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#checking-for-speculatable-tasks","title":"Checking for Speculatable Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  checkSpeculatableTasks(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  checkSpeculatableTasks requests rootPool to check for speculatable tasks (if they ran for more than 100 ms) and, if there any, requests scheduler:SchedulerBackend.md#reviveOffers[SchedulerBackend to revive offers].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: checkSpeculatableTasks is executed periodically as part of speculative-execution-of-tasks.md[].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#cleaning-up-after-removing-executor","title":"Cleaning up After Removing Executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  removeExecutor(\n  executorId: String,\n  reason: ExecutorLossReason): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  removeExecutor removes the executorId executor from the following <>: <>, executorIdToHost, executorsByHost, and hostsByRack. If the affected hosts and racks are the last entries in executorsByHost and hostsByRack, appropriately, they are removed from the registries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Unless reason is LossReasonPending, the executor is removed from executorIdToHost registry and Schedulable.md#executorLost[TaskSetManagers get notified].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: The internal removeExecutor is called as part of <> and scheduler:TaskScheduler.md#executorLost[executorLost].","text":""},{"location":"scheduler/TaskSchedulerImpl/#handling-nearly-completed-sparkcontext-initialization","title":"Handling Nearly-Completed SparkContext Initialization

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  postStartHook(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  postStartHook is part of the TaskScheduler abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  postStartHook waits until a scheduler backend is ready.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#waiting-until-schedulerbackend-is-ready","title":"Waiting Until SchedulerBackend is Ready
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  waitBackendReady(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  waitBackendReady waits until the SchedulerBackend is ready. If it is, waitBackendReady returns immediately. Otherwise, waitBackendReady keeps checking every 100 milliseconds (hardcoded) or the <> is SparkContext.md#stopped[stopped].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  A SchedulerBackend is ready by default.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If the SparkContext happens to be stopped while waiting, waitBackendReady throws an IllegalStateException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Spark context stopped while waiting for backend\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#stopping-taskschedulerimpl","title":"Stopping TaskSchedulerImpl
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  stop stops all the internal services, i.e. <task-scheduler-speculation executor service>>, scheduler:SchedulerBackend.md[SchedulerBackend], scheduler:TaskResultGetter.md[TaskResultGetter], and <> timer.","text":""},{"location":"scheduler/TaskSchedulerImpl/#default-level-of-parallelism","title":"Default Level of Parallelism

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  defaultParallelism(): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  defaultParallelism is part of the TaskScheduler abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  defaultParallelism requests the SchedulerBackend for the default level of parallelism.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Default level of parallelism is a hint for sizing jobs that SparkContext uses to create RDDs with the right number of partitions unless specified explicitly.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#submitting-tasks-of-taskset-for-execution","title":"Submitting Tasks (of TaskSet) for Execution
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  submitTasks(\n  taskSet: TaskSet): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  submitTasks is part of the TaskScheduler abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In essence, submitTasks registers a new TaskSetManager (for the given TaskSet) and requests the SchedulerBackend to handle resource allocation offers (from the scheduling system).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Internally, submitTasks prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Adding task set [id] with [length] tasks\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  submitTasks then <> (for the given TaskSet.md[TaskSet] and the <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  submitTasks registers (adds) the TaskSetManager per TaskSet.md#stageId[stage] and TaskSet.md#stageAttemptId[stage attempt] IDs (of the TaskSet.md[TaskSet]) in the <> internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: <> internal registry tracks the TaskSetManager.md[TaskSetManagers] (that represent TaskSet.md[TaskSets]) per stage and stage attempts. In other words, there could be many TaskSetManagers for a single stage, each representing a unique stage attempt.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: Not only could a task be retried (cf. <>), but also a single stage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  submitTasks makes sure that there is exactly one active TaskSetManager (with different TaskSet) across all the managers (for the stage). Otherwise, submitTasks throws an IllegalStateException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  more than one active taskSet for stage [stage]: [TaskSet ids]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: TaskSetManager is considered active when it is not a zombie.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  submitTasks requests the <> to SchedulableBuilder.md#addTaskSetManager[add the TaskSetManager to the schedulable pool].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: The TaskScheduler.md#rootPool[schedulable pool] can be a single flat linked queue (in FIFOSchedulableBuilder.md[FIFO scheduling mode]) or a hierarchy of pools of Schedulables (in FairSchedulableBuilder.md[FAIR scheduling mode]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  submitTasks <> to make sure that the requested resources (i.e. CPU and memory) are assigned to the Spark application for a <> (the very first time the Spark application is started per <> flag).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: The very first time (<> flag is false) in cluster mode only (i.e. isLocal of the TaskSchedulerImpl is false), starvationTimer is scheduled to execute after configuration-properties.md#spark.starvation.timeout[spark.starvation.timeout] to ensure that the requested resources, i.e. CPUs and memory, were assigned by a cluster manager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: After the first configuration-properties.md#spark.starvation.timeout[spark.starvation.timeout] passes, the <> internal flag is true.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, submitTasks requests the <> to scheduler:SchedulerBackend.md#reviveOffers[reviveOffers].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  TIP: Use dag-scheduler-event-loop thread to step through the code in a debugger.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#scheduling-starvation-task","title":"Scheduling Starvation Task

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Every time the starvation timer thread is executed and hasLaunchedTask flag is false, the following WARN message is printed out to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Otherwise, when the hasLaunchedTask flag is true the timer thread cancels itself.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#createTaskSetManager","title":"Creating TaskSetManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createTaskSetManager(\n  taskSet: TaskSet,\n  maxTaskFailures: Int): TaskSetManager\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createTaskSetManager creates a TaskSetManager (with this TaskSchedulerImpl, the given TaskSet and the maxTaskFailures).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createTaskSetManager is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TaskSchedulerImpl is requested to submit a TaskSet
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#notifying-tasksetmanager-that-task-failed","title":"Notifying TaskSetManager that Task Failed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  handleFailedTask(\n  taskSetManager: TaskSetManager,\n  tid: Long,\n  taskState: TaskState,\n  reason: TaskFailedReason): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  handleFailedTask scheduler:TaskSetManager.md#handleFailedTask[notifies taskSetManager that tid task has failed] and, only when scheduler:TaskSetManager.md#zombie-state[taskSetManager is not in zombie state] and tid is not in KILLED state, scheduler:SchedulerBackend.md#reviveOffers[requests SchedulerBackend to revive offers].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: handleFailedTask is called when scheduler:TaskResultGetter.md#enqueueSuccessfulTask[TaskResultGetter deserializes a TaskFailedReason] for a failed task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#tasksetfinished","title":"taskSetFinished
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  taskSetFinished(\n  manager: TaskSetManager): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  taskSetFinished looks all scheduler:TaskSet.md[TaskSet]s up by the stage id (in <> registry) and removes the stage attempt from them, possibly with removing the entire stage record from taskSetsByStageIdAndAttempt registry completely (if there are no other attempts registered).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  taskSetFinished then removes manager from the parent's schedulable pool.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  You should see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Removed TaskSet [id], whose tasks have all completed, from pool [name]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  taskSetFinished is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TaskSetManager is requested to maybeFinishTaskSet
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#notifying-dagscheduler-about-new-executor","title":"Notifying DAGScheduler About New Executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  executorAdded(\n  execId: String,\n  host: String)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  executorAdded just DAGScheduler.md#executorAdded[notifies DAGScheduler that an executor was added].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: executorAdded uses <> that was given when <>.","text":""},{"location":"scheduler/TaskSchedulerImpl/#resourceOffers","title":"Creating TaskDescriptions For Available Executor Resource Offers

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceOffers(\n  offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceOffers takes the resources offers and generates a collection of tasks (as TaskDescriptions) to launch (given the resources available).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  A WorkerOffer represents a resource offer with CPU cores free to use on an executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Internally, resourceOffers first updates <> and <> lookup tables to record new hosts and executors (given the input offers).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  For new executors (not in <>) resourceOffers <DAGScheduler that an executor was added>>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: TaskSchedulerImpl uses resourceOffers to track active executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  CAUTION: FIXME a picture with executorAdded call from TaskSchedulerImpl to DAGScheduler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceOffers requests BlacklistTracker to applyBlacklistTimeout and filters out offers on blacklisted nodes and executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: resourceOffers uses the optional <> that was given when <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  CAUTION: FIXME Expand on blacklisting

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceOffers then randomly shuffles offers (to evenly distribute tasks across executors and avoid over-utilizing some executors) and initializes the local data structures tasks and availableCpus (as shown in the figure below).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceOffers Pool.md#getSortedTaskSetQueue[takes TaskSets in scheduling order] from scheduler:TaskScheduler.md#rootPool[top-level Schedulable Pool].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  rootPool is configured when <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  rootPool is part of the scheduler:TaskScheduler.md#rootPool[TaskScheduler Contract] and exclusively managed by scheduler:SchedulableBuilder.md[SchedulableBuilders], i.e. scheduler:FIFOSchedulableBuilder.md[FIFOSchedulableBuilder] and scheduler:FairSchedulableBuilder.md[FairSchedulableBuilder] (that scheduler:SchedulableBuilder.md#addTaskSetManager[manage registering TaskSetManagers with the root pool]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  scheduler:TaskSetManager.md[TaskSetManager] manages execution of the tasks in a single scheduler:TaskSet.md[TaskSet] that represents a single scheduler:Stage.md[Stage].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  For every TaskSetManager (in scheduling order), you should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  parentName: [name], name: [name], runningTasks: [count]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Only if a new executor was added, resourceOffers scheduler:TaskSetManager.md#executorAdded[notifies every TaskSetManager about the change] (to recompute locality preferences).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceOffers then takes every TaskSetManager (in scheduling order) and offers them each node in increasing order of locality levels (per scheduler:TaskSetManager.md#computeValidLocalityLevels[TaskSetManager's valid locality levels]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: A TaskSetManager scheduler:TaskSetManager.md#computeValidLocalityLevels[computes locality levels of the tasks] it manages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  For every TaskSetManager and the TaskSetManager's valid locality level, resourceOffers tries to <> as long as the TaskSetManager manages to launch a task (given the locality level).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If resourceOffers did not manage to offer resources to a TaskSetManager so it could launch any task, resourceOffers scheduler:TaskSetManager.md#abortIfCompletelyBlacklisted[requests the TaskSetManager to abort the TaskSet if completely blacklisted].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When resourceOffers managed to launch a task, the internal <> flag gets enabled (that effectively means what the name says \"there were executors and I managed to launch a task\").

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceOffers is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • CoarseGrainedSchedulerBackend (via DriverEndpoint RPC endpoint) is requested to make executor resource offers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • LocalEndpoint is requested to revive resource offers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#maybeinitbarriercoordinator","title":"maybeInitBarrierCoordinator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  maybeInitBarrierCoordinator(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Unless a BarrierCoordinator has already been registered, maybeInitBarrierCoordinator creates a BarrierCoordinator and registers it to be known as barrierSync.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, maybeInitBarrierCoordinator prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Registered BarrierCoordinator endpoint\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#resourceOfferSingleTaskSet","title":"Finding Tasks from TaskSetManager to Schedule on Executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceOfferSingleTaskSet(\n  taskSet: TaskSetManager,\n  maxLocality: TaskLocality,\n  shuffledOffers: Seq[WorkerOffer],\n  availableCpus: Array[Int],\n  availableResources: Array[Map[String, Buffer[String]]],\n  tasks: IndexedSeq[ArrayBuffer[TaskDescription]]): (Boolean, Option[TaskLocality])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceOfferSingleTaskSet takes every WorkerOffer (from the input shuffledOffers) and (only if the number of available CPU cores (using the input availableCpus) is at least configuration-properties.md#spark.task.cpus[spark.task.cpus]) scheduler:TaskSetManager.md#resourceOffer[requests TaskSetManager (as the input taskSet) to find a Task to execute (given the resource offer)] (as an executor, a host, and the input maxLocality).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceOfferSingleTaskSet adds the task to the input tasks collection.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceOfferSingleTaskSet records the task id and TaskSetManager in some registries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceOfferSingleTaskSet decreases configuration-properties.md#spark.task.cpus[spark.task.cpus] from the input availableCpus (for the WorkerOffer).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceOfferSingleTaskSet returns whether a task was launched or not.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceOfferSingleTaskSet asserts that the number of available CPU cores (in the input availableCpus per WorkerOffer) is at least 0.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If there is a TaskNotSerializableException, resourceOfferSingleTaskSet prints out the following ERROR in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Resource offer failed, task set [name] was not serializable\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  resourceOfferSingleTaskSet is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TaskSchedulerImpl is requested to resourceOffers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#TaskLocality","title":"Task Locality Preference

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  TaskLocality represents a task locality preference and can be one of the following (from the most localized to the widest):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1. PROCESS_LOCAL
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  2. NODE_LOCAL
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  3. NO_PREF
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  4. RACK_LOCAL
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  5. ANY
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#workeroffer-free-cpu-cores-on-executor","title":"WorkerOffer \u2014 Free CPU Cores on Executor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  WorkerOffer(\n  executorId: String,\n  host: String,\n  cores: Int)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  WorkerOffer represents a resource offer with free CPU cores available on an executorId executor on a host.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#workerremoved","title":"workerRemoved
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  workerRemoved(\n  workerId: String,\n  host: String,\n  message: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  workerRemoved is part of the TaskScheduler abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  workerRemoved prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Handle removed worker [workerId]: [message]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, workerRemoved requests the DAGScheduler to workerRemoved.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#calculateAvailableSlots","title":"calculateAvailableSlots
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  calculateAvailableSlots(\n  scheduler: TaskSchedulerImpl,\n  conf: SparkConf,\n  rpId: Int,\n  availableRPIds: Array[Int],\n  availableCpus: Array[Int],\n  availableResources: Array[Map[String, Int]]): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  calculateAvailableSlots...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  calculateAvailableSlots is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TaskSchedulerImpl is requested for TaskDescriptions for the given executor resource offers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • CoarseGrainedSchedulerBackend is requested for the maximum number of concurrent tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSchedulerImpl/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Enable ALL logging level for org.apache.spark.scheduler.TaskSchedulerImpl logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  logger.TaskSchedulerImpl.name = org.apache.spark.scheduler.TaskSchedulerImpl\nlogger.TaskSchedulerImpl.level = all\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"scheduler/TaskSet/","title":"TaskSet","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  TaskSet is a collection of independent tasks of a stage (and a stage execution attempt) that are missing (uncomputed), i.e. for which computation results are unavailable (as RDD blocks on BlockManagers on executors).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In other words, a TaskSet represents the missing partitions of a stage that (as tasks) can be run right away based on the data that is already on the cluster, e.g. map output files from previous stages, though they may fail if this data becomes unavailable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Since the tasks are only the missing tasks, their number does not necessarily have to be the number of all the tasks of a stage. For a brand new stage (that has never been attempted to compute) their numbers are exactly the same.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Once DAGScheduler submits the missing tasks for execution (to the TaskScheduler), the execution of the TaskSet is managed by a TaskSetManager that allows for spark.task.maxFailures.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"scheduler/TaskSet/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  TaskSet takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Stage ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Stage (Execution) Attempt ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • FIFO Priority
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Local Properties
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Resource Profile ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskSet is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • DAGScheduler is requested to submit the missing tasks of a stage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskSet/#id","title":"ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    id: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskSet is uniquely identified by an id that uses the stageId followed by the stageAttemptId with the comma (.) in-between:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    [stageId].[stageAttemptId]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/TaskSet/#textual-representation","title":"Textual Representation
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    toString: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    toString follows the pattern:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskSet [stageId].[stageAttemptId]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/TaskSet/#task-scheduling-prioritization-fifo-scheduling","title":"Task Scheduling Prioritization (FIFO Scheduling)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskSet is given a priority when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The priority is the ID of the earliest-created active job that needs the stage (that is given when DAGScheduler is requested to submit the missing tasks of a stage).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Once submitted for execution, the priority is the priority of the TaskSetManager (which is a Schedulable) that is used for task prioritization (prioritizing scheduling of tasks) in the FIFO scheduling mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"scheduler/TaskSetBlacklist/","title":"TaskSetBlacklist","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    == [[TaskSetBlacklist]] TaskSetBlacklist -- Blacklisting Executors and Nodes For TaskSet

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    === [[updateBlacklistForFailedTask]] updateBlacklistForFailedTask Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    === [[isExecutorBlacklistedForTaskSet]] isExecutorBlacklistedForTaskSet Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    === [[isNodeBlacklistedForTaskSet]] isNodeBlacklistedForTaskSet Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskSetManager/","title":"TaskSetManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskSetManager is a Schedulable that manages scheduling the tasks of a TaskSet.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"scheduler/TaskSetManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskSetManager takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • TaskSchedulerImpl
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • TaskSet
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Number of Task Failures
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • HealthTracker
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Clock

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskSetManager is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskSchedulerImpl is requested to create a TaskSetManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      While being created, TaskSetManager requests the current epoch from MapOutputTracker and sets it on all tasks in the taskset.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskSetManager uses TaskSchedulerImpl to access the current MapOutputTracker.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskSetManager prints out the following DEBUG to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Epoch for [taskSet]: [epoch]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskSetManager adds the tasks as pending execution (in reverse order from the highest partition to the lowest).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSetManager/#maxTaskFailures","title":"Number of Task Failures","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskSetManager is given maxTaskFailures value that is how many times a single task can fail before the whole TaskSet is aborted.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Master URL Number of Task Failures local 1 local-with-retries maxFailures local-cluster spark.task.maxFailures Cluster Manager spark.task.maxFailures"},{"location":"scheduler/TaskSetManager/#isBarrier","title":"isBarrier","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      isBarrier: Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      isBarrier is enabled (true) when this TaskSetManager is created for a TaskSet with barrier tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      isBarrier is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskSchedulerImpl is requested to resourceOfferSingleTaskSet, resourceOffers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskSetManager is requested to resourceOffer, checkSpeculatableTasks, getLocalityWait
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSetManager/#resourceOffer","title":"resourceOffer","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      resourceOffer(\n  execId: String,\n  host: String,\n  maxLocality: TaskLocality.TaskLocality,\n  taskCpus: Int = sched.CPUS_PER_TASK,\n  taskResourceAssignments: Map[String, ResourceInformation] = Map.empty): (Option[TaskDescription], Boolean, Int)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      resourceOffer determines allowed locality level for the given TaskLocality being anything but NO_PREF.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      resourceOffer dequeueTask for the given execId and host, and the allowed locality level. This may or may not give a TaskDescription.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In the end, resourceOffer returns the TaskDescription, hasScheduleDelayReject, and the index of the dequeued task (if any).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      resourceOffer returns a (None, false, -1) tuple when this TaskSetManager is isZombie or the offer (by the given host or execId) should be ignored (excluded).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      resourceOffer is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskSchedulerImpl is requested to resourceOfferSingleTaskSet
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSetManager/#getLocalityWait","title":"Locality Wait","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getLocalityWait(\n  level: TaskLocality.TaskLocality): Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getLocalityWait is 0 for legacyLocalityWaitReset and isBarrier flags enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getLocalityWait determines the value of locality wait based on the given TaskLocality.TaskLocality.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskLocality Configuration Property PROCESS_LOCAL spark.locality.wait.process NODE_LOCAL spark.locality.wait.node RACK_LOCAL spark.locality.wait.rack

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Unless the value has been determined, getLocalityWait defaults to 0.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NO_PREF and ANY task localities have no locality wait.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getLocalityWait is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskSetManager is created and recomputes locality preferences
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSetManager/#maxResultSize","title":"spark.driver.maxResultSize","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskSetManager uses spark.driver.maxResultSize configuration property to check available memory for more task results.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSetManager/#recomputeLocality","title":"Recomputing Task Locality Preferences","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      recomputeLocality(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      If zombie, recomputeLocality does nothing.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      recomputeLocality recomputes myLocalityLevels, localityWaits and currentLocalityIndex internal registries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      recomputeLocality computes locality levels (for scheduled tasks) and saves the result in myLocalityLevels internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      recomputeLocality computes localityWaits by determining the locality wait for every locality level in myLocalityLevels.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      recomputeLocality computes currentLocalityIndex by getLocalityIndex with the previous locality level. If the current locality index is higher than the previous, recomputeLocality recalculates currentLocalityIndex.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      recomputeLocality is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskSetManager is notified about status change in executors (i.e., lost, decommissioned, added)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSetManager/#zombie","title":"Zombie","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      A TaskSetManager is a zombie when all tasks in a taskset have completed successfully (regardless of the number of task attempts), or if the taskset has been aborted.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      While in zombie state, a TaskSetManager can launch no new tasks and responds with no TaskDescriptions to resourceOffers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      A TaskSetManager remains in the zombie state until all tasks have finished running, i.e. to continue to track and account for the running tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSetManager/#computeValidLocalityLevels","title":"Computing Locality Levels (for Scheduled Tasks)","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      computeValidLocalityLevels(): Array[TaskLocality.TaskLocality]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      computeValidLocalityLevels computes valid locality levels for tasks that were registered in corresponding registries per locality level.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskLocality is a locality preference of a task and can be the most localized PROCESS_LOCAL, NODE_LOCAL through NO_PREF and RACK_LOCAL to ANY.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      For every pending task (in pendingTasks registry), computeValidLocalityLevels requests the TaskSchedulerImpl for acceptable TaskLocalityies:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • For every executor, computeValidLocalityLevels requests the TaskSchedulerImpl to isExecutorAlive and adds PROCESS_LOCAL
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • For every host, computeValidLocalityLevels requests the TaskSchedulerImpl to hasExecutorsAliveOnHost and adds NODE_LOCAL
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • For any pending tasks with no locality preference, computeValidLocalityLevels adds NO_PREF
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • For every rack, computeValidLocalityLevels requests the TaskSchedulerImpl to hasHostAliveOnRack and adds RACK_LOCAL

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      computeValidLocalityLevels always registers ANY task locality level.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In the end, computeValidLocalityLevels prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Valid locality levels for [taskSet]: [comma-separated levels]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      computeValidLocalityLevels is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskSetManager is created and to recomputeLocality
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSetManager/#executorAdded","title":"executorAdded","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      executorAdded(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      executorAdded recomputeLocality.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      executorAdded is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskSchedulerImpl is requested to handle resource offers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSetManager/#prepareLaunchingTask","title":"prepareLaunchingTask","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      prepareLaunchingTask(\n  execId: String,\n  host: String,\n  index: Int,\n  taskLocality: TaskLocality.Value,\n  speculative: Boolean,\n  taskCpus: Int,\n  taskResourceAssignments: Map[String, ResourceInformation],\n  launchTime: Long): TaskDescription\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      taskResourceAssignments

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      taskResourceAssignments are the resources that are passed in to resourceOffer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      prepareLaunchingTask...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      prepareLaunchingTask is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskSchedulerImpl is requested to resourceOffers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • TaskSetManager is requested to resourceOffers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSetManager/#TASK_SIZE_TO_WARN_KIB","title":"Serialized Task Size Threshold","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskSetManager object defines TASK_SIZE_TO_WARN_KIB value as the threshold to warn a user if any stages contain a task that has a serialized size greater than 1000 kB.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSetManager/#TASK_SIZE_TO_WARN_KIB-DAGScheduler","title":"DAGScheduler","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      DAGScheduler can print out the following WARN message to the logs when requested to submitMissingTasks:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Broadcasting large task binary with size [taskBinaryBytes] [siByteSuffix]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSetManager/#TASK_SIZE_TO_WARN_KIB-TaskSetManager","title":"TaskSetManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      TaskSetManager can print out the following WARN message to the logs when requested to prepareLaunchingTask:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Stage [stageId] contains a task of very large size ([serializedTask] KiB).\nThe maximum recommended task size is 1000 KiB.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSetManager/#demo","title":"Demo","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Enable DEBUG logging level for org.apache.spark.scheduler.TaskSchedulerImpl (or org.apache.spark.scheduler.cluster.YarnScheduler for YARN) and org.apache.spark.scheduler.TaskSetManager and execute the following two-stage job to see their low-level innerworkings.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      A cluster manager is recommended since it gives more task localization choices (with YARN additionally supporting rack localization).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      $ ./bin/spark-shell \\\n    --master yarn \\\n    --conf spark.ui.showConsoleProgress=false\n\n// Keep # partitions low to keep # messages low\n\nscala> sc.parallelize(0 to 9, 3).groupBy(_ % 3).count\nINFO YarnScheduler: Adding task set 0.0 with 3 tasks\nDEBUG TaskSetManager: Epoch for TaskSet 0.0: 0\nDEBUG TaskSetManager: Valid locality levels for TaskSet 0.0: NO_PREF, ANY\nDEBUG YarnScheduler: parentName: , name: TaskSet_0.0, runningTasks: 0\nINFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.0.2.87, executor 1, partition 0, PROCESS_LOCAL, 7541 bytes)\nINFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.0.2.87, executor 2, partition 1, PROCESS_LOCAL, 7541 bytes)\nDEBUG YarnScheduler: parentName: , name: TaskSet_0.0, runningTasks: 1\nINFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.0.2.87, executor 1, partition 2, PROCESS_LOCAL, 7598 bytes)\nDEBUG YarnScheduler: parentName: , name: TaskSet_0.0, runningTasks: 1\nDEBUG TaskSetManager: No tasks for locality level NO_PREF, so moving to locality level ANY\nINFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 518 ms on 10.0.2.87 (executor 1) (1/3)\nINFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 512 ms on 10.0.2.87 (executor 2) (2/3)\nDEBUG YarnScheduler: parentName: , name: TaskSet_0.0, runningTasks: 0\nINFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 51 ms on 10.0.2.87 (executor 1) (3/3)\nINFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool\nINFO YarnScheduler: Adding task set 1.0 with 3 tasks\nDEBUG TaskSetManager: Epoch for TaskSet 1.0: 1\nDEBUG TaskSetManager: Valid locality levels for TaskSet 1.0: NODE_LOCAL, RACK_LOCAL, ANY\nDEBUG YarnScheduler: parentName: , name: TaskSet_1.0, runningTasks: 0\nINFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 3, 10.0.2.87, executor 2, partition 0, NODE_LOCAL, 7348 bytes)\nINFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 4, 10.0.2.87, executor 1, partition 1, NODE_LOCAL, 7348 bytes)\nDEBUG YarnScheduler: parentName: , name: TaskSet_1.0, runningTasks: 1\nINFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 5, 10.0.2.87, executor 1, partition 2, NODE_LOCAL, 7348 bytes)\nINFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 4) in 130 ms on 10.0.2.87 (executor 1) (1/3)\nDEBUG YarnScheduler: parentName: , name: TaskSet_1.0, runningTasks: 1\nDEBUG TaskSetManager: No tasks for locality level NODE_LOCAL, so moving to locality level RACK_LOCAL\nDEBUG TaskSetManager: No tasks for locality level RACK_LOCAL, so moving to locality level ANY\nINFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 3) in 133 ms on 10.0.2.87 (executor 2) (2/3)\nDEBUG YarnScheduler: parentName: , name: TaskSet_1.0, runningTasks: 0\nINFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 5) in 21 ms on 10.0.2.87 (executor 1) (3/3)\nINFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool\nres0: Long = 3\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"scheduler/TaskSetManager/#logging","title":"Logging","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Enable ALL logging level for org.apache.spark.scheduler.TaskSetManager logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      log4j.logger.org.apache.spark.scheduler.TaskSetManager=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Refer to Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"serializer/","title":"Serialization System","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Serialization System is a core component of Apache Spark with pluggable serializers for task closures and block data.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Serialization System uses SerializerManager to select the Serializer (based on spark.serializer configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"serializer/DeserializationStream/","title":"DeserializationStream","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      = DeserializationStream

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      DeserializationStream is an abstraction of streams for reading serialized objects.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      == [[readObject]] readObject Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"serializer/DeserializationStream/#source-scala","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#readobjectt-classtag-t","title":"readObjectT: ClassTag: T","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      readObject...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      readObject is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      == [[readKey]] readKey Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"serializer/DeserializationStream/#source-scala_1","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#readkeyt-classtag-t","title":"readKeyT: ClassTag: T","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      readKey <> representing the key of a key-value record.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      readKey is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      == [[readValue]] readValue Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"serializer/DeserializationStream/#source-scala_2","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#readvaluet-classtag-t","title":"readValueT: ClassTag: T","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      readValue <> representing the value of a key-value record.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      readValue is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      == [[asIterator]] asIterator Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"serializer/DeserializationStream/#source-scala_3","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#asiterator-iteratorany","title":"asIterator: Iterator[Any]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      asIterator...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      asIterator is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      == [[asKeyValueIterator]] asKeyValueIterator Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"serializer/DeserializationStream/#source-scala_4","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#askeyvalueiterator-iteratorany","title":"asKeyValueIterator: Iterator[Any]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      asKeyValueIterator...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      asKeyValueIterator is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"serializer/JavaSerializerInstance/","title":"JavaSerializerInstance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      JavaSerializerInstance is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"serializer/KryoSerializer/","title":"KryoSerializer","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      KryoSerializer is a Serializer that uses the Kryo serialization library.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"serializer/KryoSerializer/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      KryoSerializer takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SparkConf

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        KryoSerializer is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SerializerManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkConf is requested to registerKryoClasses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SerializerSupport (Spark SQL) is requested for a SerializerInstance
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"serializer/KryoSerializer/#useunsafe-flag","title":"useUnsafe Flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        KryoSerializer uses the spark.kryo.unsafe configuration property for useUnsafe flag (initialized when KryoSerializer is created).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        useUnsafe\u00a0is used when KryoSerializer is requested to create the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • KryoSerializerInstance
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • KryoOutput
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"serializer/KryoSerializer/#creating-new-serializerinstance","title":"Creating New SerializerInstance
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        newInstance(): SerializerInstance\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        newInstance\u00a0is part of the Serializer abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        newInstance creates a KryoSerializerInstance with this KryoSerializer (and the useUnsafe and usePool flags).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"serializer/KryoSerializer/#newkryooutput","title":"newKryoOutput
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        newKryoOutput(): KryoOutput\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        newKryoOutput...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        newKryoOutput\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • KryoSerializerInstance is requested for the output
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"serializer/KryoSerializer/#newkryo","title":"newKryo
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        newKryo(): Kryo\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        newKryo...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        newKryo\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • KryoSerializer is requested for a KryoFactory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • KryoSerializerInstance is requested to borrowKryo
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"serializer/KryoSerializer/#kryofactory","title":"KryoFactory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        factory: KryoFactory\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        KryoSerializer creates a KryoFactory lazily (on demand and once only) for internalPool.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"serializer/KryoSerializer/#kryopool","title":"KryoPool

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        KryoSerializer creates a custom KryoPool lazily (on demand and once only).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        KryoPool is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • pool
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • setDefaultClassLoader
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"serializer/KryoSerializer/#supportsrelocationofserializedobjects","title":"supportsRelocationOfSerializedObjects
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        supportsRelocationOfSerializedObjects: Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        supportsRelocationOfSerializedObjects\u00a0is part of the Serializer abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        supportsRelocationOfSerializedObjects creates a new SerializerInstance (that is assumed to be a KryoSerializerInstance) and requests it to get the value of the autoReset field.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"serializer/KryoSerializerInstance/","title":"KryoSerializerInstance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        KryoSerializerInstance is a SerializerInstance.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"serializer/KryoSerializerInstance/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        KryoSerializerInstance takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • KryoSerializer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • useUnsafe flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • usePool flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          KryoSerializerInstance is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • KryoSerializer is requested for a new SerializerInstance
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"serializer/KryoSerializerInstance/#output","title":"Output

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          KryoSerializerInstance creates Kryo's Output lazily (on demand and once only).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          KryoSerializerInstance requests the KryoSerializer for a newKryoOutput.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          output\u00a0is used for serialization.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/KryoSerializerInstance/#serialize","title":"serialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          serialize[T: ClassTag](\n  t: T): ByteBuffer\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          serialize\u00a0is part of the SerializerInstance abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          serialize...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/KryoSerializerInstance/#deserialize","title":"deserialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          deserialize[T: ClassTag](\n  bytes: ByteBuffer): T\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          deserialize\u00a0is part of the SerializerInstance abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          deserialize...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/KryoSerializerInstance/#releasing-kryo-instance","title":"Releasing Kryo Instance
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          releaseKryo(\n  kryo: Kryo): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          releaseKryo...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          releaseKryo\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • KryoSerializationStream is requested to close
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • KryoDeserializationStream is requested to close
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • KryoSerializerInstance is requested to serialize and deserialize (and getAutoReset)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/KryoSerializerInstance/#getautoreset","title":"getAutoReset
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getAutoReset(): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getAutoReset uses Java Reflection to access the value of the autoReset field of the Kryo class.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getAutoReset\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • KryoSerializer is requested for the supportsRelocationOfSerializedObjects flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/SerializationStream/","title":"SerializationStream","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SerializationStream is an abstraction of serialized streams for writing out serialized key-value records.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"serializer/SerializationStream/#contract","title":"Contract","text":""},{"location":"serializer/SerializationStream/#closing-stream","title":"Closing Stream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          close(): Unit\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/SerializationStream/#flushing-stream","title":"Flushing Stream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          flush(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • UnsafeShuffleWriter is requested to insert a record into a ShuffleExternalSorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • DiskBlockObjectWriter is requested to commitAndGet
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/SerializationStream/#writing-out-object","title":"Writing Out Object
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          writeObject[T: ClassTag](\n  t: T): SerializationStream\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • MemoryStore is requested to putIteratorAsBytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • JavaSerializerInstance is requested to serialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • RequestMessage is requested to serialize (for NettyRpcEnv)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ParallelCollectionPartition is requested to writeObject (for ParallelCollectionRDD)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ReliableRDDCheckpointData is requested to doCheckpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TorrentBroadcast is created (and requested to writeBlocks)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • RangePartitioner is requested to writeObject
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SerializationStream is requested to writeKey, writeValue or writeAll
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • FileSystemPersistenceEngine is requested to serializeIntoFile (for Spark Standalone's Master)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/SerializationStream/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • JavaSerializationStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • KryoSerializationStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"serializer/SerializationStream/#writing-out-all-records","title":"Writing Out All Records
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          writeAll[T: ClassTag](\n  iter: Iterator[T]): SerializationStream\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          writeAll writes out records of the given iterator (one by one as objects).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          writeAll is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ReliableCheckpointRDD is requested to doCheckpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SerializerManager is requested to dataSerializeStream and dataSerializeWithExplicitClassTag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/SerializationStream/#writing-out-key","title":"Writing Out Key
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          writeKey[T: ClassTag](\n  key: T): SerializationStream\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Writes out the key

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          writeKey is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • UnsafeShuffleWriter is requested to insert a record into a ShuffleExternalSorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • DiskBlockObjectWriter is requested to write the key and value of a record
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/SerializationStream/#writing-out-value","title":"Writing Out Value
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          writeValue[T: ClassTag](\n  value: T): SerializationStream\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Writes out the value

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          writeValue is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • UnsafeShuffleWriter is requested to insert a record into a ShuffleExternalSorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • DiskBlockObjectWriter is requested to write the key and value of a record
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/Serializer/","title":"Serializer","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Serializer is an abstraction of serializers for serialization and deserialization of tasks (closures) and data blocks in a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"serializer/Serializer/#contract","title":"Contract","text":""},{"location":"serializer/Serializer/#creating-new-serializerinstance","title":"Creating New SerializerInstance
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          newInstance(): SerializerInstance\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Creates a new SerializerInstance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Task is created (only used in tests)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SerializerSupport (Spark SQL) utility is used to newSerializer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • RangePartitioner is requested to writeObject and readObject
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TorrentBroadcast utility is used to blockifyObject and unBlockifyObject
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskRunner is requested to run
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • NettyBlockRpcServer is requested to deserializeMetadata
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • NettyBlockTransferService is requested to uploadBlock
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • PairRDDFunctions is requested to...FIXME
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ParallelCollectionPartition is requested to...FIXME
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • RDD is requested to...FIXME
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ReliableCheckpointRDD utility is used to...FIXME
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • NettyRpcEnvFactory is requested to create a RpcEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • DAGScheduler is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • others
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/Serializer/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • JavaSerializer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • KryoSerializer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • UnsafeRowSerializer (Spark SQL)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"serializer/Serializer/#accessing-serializer","title":"Accessing Serializer","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Serializer is available using SparkEnv as the closureSerializer and serializer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"serializer/Serializer/#closureserializer","title":"closureSerializer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SparkEnv.get.closureSerializer\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/Serializer/#serializer_1","title":"serializer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SparkEnv.get.serializer\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/Serializer/#serialized-objects-relocation-requirements","title":"Serialized Objects Relocation Requirements
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          supportsRelocationOfSerializedObjects: Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          supportsRelocationOfSerializedObjects is disabled (false) by default.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          supportsRelocationOfSerializedObjects is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BlockStoreShuffleReader is requested to fetchContinuousBlocksInBatch
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SortShuffleManager is requested to create a ShuffleHandle for a given ShuffleDependency (and checks out SerializedShuffleHandle requirements)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/SerializerInstance/","title":"SerializerInstance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SerializerInstance is an abstraction of serializer instances (for use by one thread at a time).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"serializer/SerializerInstance/#contract","title":"Contract","text":""},{"location":"serializer/SerializerInstance/#deserializing-from-bytebuffer","title":"Deserializing (from ByteBuffer)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          deserialize[T: ClassTag](\n  bytes: ByteBuffer): T\ndeserialize[T: ClassTag](\n  bytes: ByteBuffer,\n  loader: ClassLoader): T\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskRunner is requested to run
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ResultTask is requested to run
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ShuffleMapTask is requested to run
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TaskResultGetter is requested to enqueueFailedTask
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • others
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/SerializerInstance/#deserializing-from-inputstream","title":"Deserializing (from InputStream)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          deserializeStream(\n  s: InputStream): DeserializationStream\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/SerializerInstance/#serializing-to-bytebuffer","title":"Serializing (to ByteBuffer)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          serialize[T: ClassTag](\n  t: T): ByteBuffer\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/SerializerInstance/#serializing-to-outputstream","title":"Serializing (to OutputStream)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          serializeStream(\n  s: OutputStream): SerializationStream\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"serializer/SerializerInstance/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • JavaSerializerInstance
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • KryoSerializerInstance
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • UnsafeRowSerializerInstance (Spark SQL)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"serializer/SerializerManager/","title":"SerializerManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SerializerManager is used to select the Serializer for shuffle blocks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"serializer/SerializerManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SerializerManager takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Default Serializer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • (optional) Encryption Key (Option[Array[Byte]])

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SerializerManager is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkEnv utility is used to create a SparkEnv (for the driver and executors)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"serializer/SerializerManager/#kryo-compatible-types","title":"Kryo-Compatible Types

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Kryo-Compatible Types are the following primitive types, Arrays of the primitive types and Strings:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Boolean
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Byte
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Char
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Double
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Float
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Int
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Long
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Null
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Short
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"serializer/SerializerManager/#default-serializer","title":"Default Serializer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SerializerManager is given a Serializer when created (based on spark.serializer configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The Serializer is used when SerializerManager is requested for a Serializer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Tip

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Enable DEBUG logging level of SparkEnv to be told about the selected Serializer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Using serializer: [serializer]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"serializer/SerializerManager/#accessing-serializermanager","title":"Accessing SerializerManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SerializerManager is available using SparkEnv on the driver and executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            import org.apache.spark.SparkEnv\nSparkEnv.get.serializerManager\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"serializer/SerializerManager/#kryoserializer","title":"KryoSerializer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SerializerManager creates a KryoSerializer when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            KryoSerializer is used as the serializer when the types of a given key and value are Kryo-compatible.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"serializer/SerializerManager/#selecting-serializer","title":"Selecting Serializer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getSerializer(\n  ct: ClassTag[_],\n  autoPick: Boolean): Serializer\ngetSerializer(\n  keyClassTag: ClassTag[_],\n  valueClassTag: ClassTag[_]): Serializer\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getSerializer returns the KryoSerializer when the given ClassTags are Kryo-compatible and the autoPick flag is true. Otherwise, getSerializer returns the default Serializer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            autoPick flag is true for all BlockIds but Spark Streaming's StreamBlockIds.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getSerializer (with autoPick flag) is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SerializerManager is requested to dataSerializeStream, dataSerializeWithExplicitClassTag and dataDeserializeStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SerializedValuesHolder (of MemoryStore) is requested for a SerializationStream

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getSerializer (with key and value ClassTags only) is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffledRDD is requested for dependencies
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"serializer/SerializerManager/#dataserializestream","title":"dataSerializeStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            dataSerializeStream[T: ClassTag](\n  blockId: BlockId,\n  outputStream: OutputStream,\n  values: Iterator[T]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            dataSerializeStream...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            dataSerializeStream\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to doPutIterator and dropFromMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"serializer/SerializerManager/#dataserializewithexplicitclasstag","title":"dataSerializeWithExplicitClassTag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            dataSerializeWithExplicitClassTag(\n  blockId: BlockId,\n  values: Iterator[_],\n  classTag: ClassTag[_]): ChunkedByteBuffer\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            dataSerializeWithExplicitClassTag...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            dataSerializeWithExplicitClassTag\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to doGetLocalBytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SerializerManager is requested to dataSerialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"serializer/SerializerManager/#datadeserializestream","title":"dataDeserializeStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            dataDeserializeStream[T](\n  blockId: BlockId,\n  inputStream: InputStream)\n  (classTag: ClassTag[T]): Iterator[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            dataDeserializeStream...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            dataDeserializeStream\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockStoreUpdater is requested to saveDeserializedValuesToMemoryStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to getLocalValues and getRemoteValues
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MemoryStore is requested to putIteratorAsBytes (when PartiallySerializedBlock is requested for a PartiallyUnrolledIterator)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/","title":"Shuffle System","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Shuffle System is a core service of Apache Spark that is responsible for shuffle blocks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The main core abstraction is ShuffleManager with SortShuffleManager as the default and only known implementation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spark.shuffle.manager configuration property allows for a custom ShuffleManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Shuffle System uses shuffle handles, readers and writers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"shuffle/#resources","title":"Resources","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Improving Apache Spark Downscaling by Christopher Crosbie (Google) Ben Sidhom (Google)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Spark shuffle introduction by Raymond Liu (aka colorant)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"shuffle/BaseShuffleHandle/","title":"BaseShuffleHandle","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BaseShuffleHandle is a ShuffleHandle that is used to capture the parameters when SortShuffleManager is requested for a ShuffleHandle (and the other specialized ShuffleHandles could not be selected):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Shuffle ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffleDependency"},{"location":"shuffle/BaseShuffleHandle/#extensions","title":"Extensions","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BypassMergeSortShuffleHandle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SerializedShuffleHandle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/BaseShuffleHandle/#demo","title":"Demo","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              // Start a Spark application, e.g. spark-shell, with the Spark properties to trigger selection of BaseShuffleHandle:\n// 1. spark.shuffle.spill.numElementsForceSpillThreshold=1\n// 2. spark.shuffle.sort.bypassMergeThreshold=1\n\n// numSlices > spark.shuffle.sort.bypassMergeThreshold\nscala> val rdd = sc.parallelize(0 to 4, numSlices = 2).groupBy(_ % 2)\nrdd: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[2] at groupBy at <console>:24\n\nscala> rdd.dependencies\nDEBUG SortShuffleManager: Can't use serialized shuffle for shuffle 0 because an aggregator is defined\nres0: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.ShuffleDependency@1160c54b)\n\nscala> rdd.getNumPartitions\nres1: Int = 2\n\nscala> import org.apache.spark.ShuffleDependency\nimport org.apache.spark.ShuffleDependency\n\nscala> val shuffleDep = rdd.dependencies(0).asInstanceOf[ShuffleDependency[Int, Int, Int]]\nshuffleDep: org.apache.spark.ShuffleDependency[Int,Int,Int] = org.apache.spark.ShuffleDependency@1160c54b\n\n// mapSideCombine is disabled\nscala> shuffleDep.mapSideCombine\nres2: Boolean = false\n\n// aggregator defined\nscala> shuffleDep.aggregator\nres3: Option[org.apache.spark.Aggregator[Int,Int,Int]] = Some(Aggregator(<function1>,<function2>,<function2>))\n\n// the number of reduce partitions < spark.shuffle.sort.bypassMergeThreshold\nscala> shuffleDep.partitioner.numPartitions\nres4: Int = 2\n\nscala> shuffleDep.shuffleHandle\nres5: org.apache.spark.shuffle.ShuffleHandle = org.apache.spark.shuffle.BaseShuffleHandle@22b0fe7e\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/BlockStoreShuffleReader/","title":"BlockStoreShuffleReader","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BlockStoreShuffleReader is a ShuffleReader.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/BlockStoreShuffleReader/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BlockStoreShuffleReader takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BaseShuffleHandle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Blocks by Address (Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • TaskContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ShuffleReadMetricsReporter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SerializerManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • MapOutputTracker
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • shouldBatchFetch flag (default: false)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockStoreShuffleReader is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SortShuffleManager is requested for a ShuffleReader (for a ShuffleHandle and a range of reduce partitions)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"shuffle/BlockStoreShuffleReader/#reading-combined-records-for-reduce-task","title":"Reading Combined Records (for Reduce Task)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                read(): Iterator[Product2[K, C]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                read\u00a0is part of the ShuffleReader abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                read creates a ShuffleBlockFetcherIterator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                read...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/BlockStoreShuffleReader/#fetchcontinuousblocksinbatch","title":"fetchContinuousBlocksInBatch
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                fetchContinuousBlocksInBatch: Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                fetchContinuousBlocksInBatch...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/BlockStoreShuffleReader/#review-me","title":"Review Me

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                === [[read]] Reading Combined Records For Reduce Task

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Internally, read first storage:ShuffleBlockFetcherIterator.md#creating-instance[creates a ShuffleBlockFetcherIterator] (passing in the values of <>, <> and <> Spark properties).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: read uses scheduler:MapOutputTracker.md#getMapSizesByExecutorId[MapOutputTracker to find the BlockManagers with the shuffle blocks and sizes] to create ShuffleBlockFetcherIterator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                read creates a new serializer:SerializerInstance.md[SerializerInstance] (using Serializer from ShuffleDependency).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                read creates a key/value iterator by deserializeStream every shuffle block stream.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                read updates the context task metrics for each record read.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: read uses CompletionIterator (to count the records read) and spark-InterruptibleIterator.md[InterruptibleIterator] (to support task cancellation).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                If the ShuffleDependency has an Aggregator defined, read wraps the current iterator inside an iterator defined by Aggregator.combineCombinersByKey (for mapSideCombine enabled) or Aggregator.combineValuesByKey otherwise.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: run reports an exception when ShuffleDependency has no Aggregator defined with mapSideCombine flag enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For keyOrdering defined in the ShuffleDependency, run does the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                1. shuffle:ExternalSorter.md#creating-instance[Creates an ExternalSorter]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2. shuffle:ExternalSorter.md#insertAll[Inserts all the records] into the ExternalSorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                3. Updates context TaskMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                4. Returns a CompletionIterator for the ExternalSorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/BypassMergeSortShuffleHandle/","title":"BypassMergeSortShuffleHandle","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BypassMergeSortShuffleHandle is a BaseShuffleHandle that SortShuffleManager uses when can avoid merge-sorting data (when requested to register a shuffle).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                SerializedShuffleHandle tells SortShuffleManager to use BypassMergeSortShuffleWriter when requested for a ShuffleWriter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"shuffle/BypassMergeSortShuffleHandle/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BypassMergeSortShuffleHandle takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Shuffle ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ShuffleDependency

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BypassMergeSortShuffleHandle is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SortShuffleManager is requested for a ShuffleHandle (for the ShuffleDependency)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/BypassMergeSortShuffleHandle/#demo","title":"Demo","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  val rdd = sc.parallelize(0 to 8).groupBy(_ % 3)\n\nassert(rdd.dependencies.length == 1)\n\nimport org.apache.spark.ShuffleDependency\nval shuffleDep = rdd.dependencies.head.asInstanceOf[ShuffleDependency[Int, Int, Int]]\n\nassert(shuffleDep.mapSideCombine == false, \"mapSideCombine should be disabled\")\nassert(shuffleDep.aggregator.isDefined)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  // Use ':paste -raw' mode to paste the code\npackage org.apache.spark\nobject open {\n  import org.apache.spark.SparkContext\n  def bypassMergeThreshold(sc: SparkContext) = {\n    import org.apache.spark.internal.config.SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD\n    sc.getConf.get(SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD)\n  }\n}\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  import org.apache.spark.open\nval bypassMergeThreshold = open.bypassMergeThreshold(sc)\n\nassert(shuffleDep.partitioner.numPartitions < bypassMergeThreshold)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  import org.apache.spark.shuffle.sort.BypassMergeSortShuffleHandle\n// BypassMergeSortShuffleHandle is private[spark]\n// so the following won't work :(\n// assert(shuffleDep.shuffleHandle.isInstanceOf[BypassMergeSortShuffleHandle[Int, Int]])\nassert(shuffleDep.shuffleHandle.toString.contains(\"BypassMergeSortShuffleHandle\"))\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/BypassMergeSortShuffleWriter/","title":"BypassMergeSortShuffleWriter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BypassMergeSortShuffleWriter&lt;K, V&gt; is a ShuffleWriter for ShuffleMapTasks to write records into one single shuffle block data file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/BypassMergeSortShuffleWriter/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BypassMergeSortShuffleWriter takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BypassMergeSortShuffleHandle (of K keys and V values)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Map ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ShuffleWriteMetricsReporter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ShuffleExecutorComponents

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    BypassMergeSortShuffleWriter is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SortShuffleManager is requested for a ShuffleWriter (for a BypassMergeSortShuffleHandle)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/BypassMergeSortShuffleWriter/#diskblockobjectwriters","title":"DiskBlockObjectWriters
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DiskBlockObjectWriter[] partitionWriters\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    BypassMergeSortShuffleWriter uses a DiskBlockObjectWriter per partition (based on the Partitioner).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    BypassMergeSortShuffleWriter asserts that no partitionWriters are created while writing out records to a shuffle file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    While writing, BypassMergeSortShuffleWriter requests the BlockManager for as many DiskBlockObjectWriters as there are partitions (in the Partitioner).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    While writing, BypassMergeSortShuffleWriter requests the Partitioner for a partition for records (using keys) and finds the per-partition DiskBlockObjectWriter that is requested to write out the partition records. After all records are written out to their shuffle files, the DiskBlockObjectWriters are requested to commitAndGet.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    BypassMergeSortShuffleWriter uses the partition writers while writing out partition data and removes references to them (nullify them) in the end.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In other words, after writing out partition data partitionWriters internal registry is null.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    partitionWriters internal registry becomes null after BypassMergeSortShuffleWriter has finished:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Writing out partition data
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Stopping
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#indexshuffleblockresolver","title":"IndexShuffleBlockResolver

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    BypassMergeSortShuffleWriter is given a IndexShuffleBlockResolver when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    BypassMergeSortShuffleWriter uses the IndexShuffleBlockResolver for writing out records (to writeIndexFileAndCommit and getDataFile).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#serializer","title":"Serializer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    When created, BypassMergeSortShuffleWriter requests the ShuffleDependency (of the given BypassMergeSortShuffleHandle) for the Serializer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    BypassMergeSortShuffleWriter creates a new instance of the Serializer for writing out records.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#configuration-properties","title":"Configuration Properties","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#sparkshufflefilebuffer","title":"spark.shuffle.file.buffer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    BypassMergeSortShuffleWriter uses spark.shuffle.file.buffer configuration property for...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#sparkfiletransferto","title":"spark.file.transferTo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    BypassMergeSortShuffleWriter uses spark.file.transferTo configuration property to control whether to use Java New I/O while writing to a partitioned file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#writing-out-records-to-shuffle-file","title":"Writing Out Records to Shuffle File
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    void write(\n  Iterator<Product2<K, V>> records)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    write is part of the ShuffleWriter abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    write creates a new instance of the Serializer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    write initializes the partitionWriters and partitionWriterSegments internal registries (for DiskBlockObjectWriters and FileSegments for every partition, respectively).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    write requests the BlockManager for the DiskBlockManager and for every partition write requests it for a shuffle block ID and the file. write creates a DiskBlockObjectWriter for the shuffle block (using the BlockManager). write stores the reference to DiskBlockObjectWriters in the partitionWriters internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    After all DiskBlockObjectWriters are created, write requests the ShuffleWriteMetrics to increment shuffle write time metric.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    For every record (a key-value pair), write requests the Partitioner for the partition ID for the key. The partition ID is then used as an index of the partition writer (among the DiskBlockObjectWriters) to write the current record out to a block file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Once all records have been writted out to their respective block files, write does the following for every DiskBlockObjectWriter:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    1. Requests the DiskBlockObjectWriter to commit and return a corresponding FileSegment of the shuffle block

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    2. Saves the (reference to) FileSegments in the partitionWriterSegments internal registry

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    3. Requests the DiskBlockObjectWriter to close

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    At this point, all the records are in shuffle block files on a local disk. The records are split across block files by key.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    write requests the IndexShuffleBlockResolver for the shuffle file for the shuffle and the mapDs>>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    write creates a temporary file (based on the name of the shuffle file) and writes all the per-partition shuffle files to it. The size of every per-partition shuffle files is saved as the partitionLengths internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    At this point, all the per-partition shuffle block files are one single map shuffle data file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    write requests the IndexShuffleBlockResolver to write shuffle index and data files for the shuffle and the map IDs (with the partitionLengths and the temporary shuffle output file).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    write returns a shuffle map output status (with the shuffle server ID and the partitionLengths).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#no-records","title":"No Records

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    When there is no records to write out, write initializes the partitionLengths internal array (of numPartitions size) with all elements being 0.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    write requests the IndexShuffleBlockResolver to write shuffle index and data files, but the difference (compared to when there are records to write) is that the dataTmp argument is simply null.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    write sets the internal mapStatus (with the address of BlockManager in use and partitionLengths).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#requirements","title":"Requirements

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    write requires that there are no DiskBlockObjectWriters.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#writing-out-partitioned-data","title":"Writing Out Partitioned Data
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    long[] writePartitionedData(\n  ShuffleMapOutputWriter mapOutputWriter)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    writePartitionedData makes sure that DiskBlockObjectWriters are available (partitionWriters != null).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    For every partition, writePartitionedData takes the partition file (from the FileSegments). Only when the partition file exists, writePartitionedData requests the given ShuffleMapOutputWriter for a ShufflePartitionWriter and writes out the partitioned data. At the end, writePartitionedData deletes the file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    writePartitionedData requests the ShuffleWriteMetricsReporter to increment the write time.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In the end, writePartitionedData requests the ShuffleMapOutputWriter to commitAllPartitions and returns the size of each partition of the output map file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#copying-raw-bytes-between-input-streams","title":"Copying Raw Bytes Between Input Streams
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    copyStream(\n  in: InputStream,\n  out: OutputStream,\n  closeStreams: Boolean = false,\n  transferToEnabled: Boolean = false): Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    copyStream branches off depending on the type of in and out streams, i.e. whether they are both FileInputStream with transferToEnabled input flag is enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    If they are both FileInputStream with transferToEnabled enabled, copyStream gets their FileChannels and transfers bytes from the input file to the output file and counts the number of bytes, possibly zero, that were actually transferred.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NOTE: copyStream uses Java's {java-javadoc-url}/java/nio/channels/FileChannel.html[java.nio.channels.FileChannel] to manage file channels.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    If either in and out input streams are not FileInputStream or transferToEnabled flag is disabled (default), copyStream reads data from in to write to out and counts the number of bytes written.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    copyStream can optionally close in and out streams (depending on the input closeStreams -- disabled by default).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NOTE: Utils.copyStream is used when <> (among other places).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Tip

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Visit the official web site of JSR 51: New I/O APIs for the Java Platform and read up on java.nio package.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#stopping-shufflewriter","title":"Stopping ShuffleWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Option<MapStatus> stop(\n  boolean success)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    stop...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    stop\u00a0is part of the ShuffleWriter abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#temporary-array-of-partition-lengths","title":"Temporary Array of Partition Lengths
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    long[] partitionLengths\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Temporary array of partition lengths after records are written to a shuffle system.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Initialized every time BypassMergeSortShuffleWriter writes out records (before passing it in to IndexShuffleBlockResolver). After IndexShuffleBlockResolver finishes, it is used to initialize mapStatus internal property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Enable ALL logging level for org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    log4j.logger.org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#internal-properties","title":"Internal Properties","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#numpartitions","title":"numPartitions","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#partitionwritersegments","title":"partitionWriterSegments","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#mapstatus","title":"mapStatus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    MapStatus that BypassMergeSortShuffleWriter returns when stopped

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Initialized every time BypassMergeSortShuffleWriter writes out records.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when BypassMergeSortShuffleWriter stops (with success enabled) as a marker if any records were written and returned if they did.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/DownloadFileManager/","title":"DownloadFileManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DownloadFileManager is an abstraction of file managers that can createTempFile and registerTempFileToClean.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/DownloadFileManager/#contract","title":"Contract","text":""},{"location":"shuffle/DownloadFileManager/#createtempfile","title":"createTempFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DownloadFile createTempFile(\n  TransportConf transportConf)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • DownloadCallback (of OneForOneBlockFetcher) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/DownloadFileManager/#registertempfiletoclean","title":"registerTempFileToClean
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    boolean registerTempFileToClean(\n  DownloadFile file)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • DownloadCallback (of OneForOneBlockFetcher) is requested to onComplete
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/DownloadFileManager/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • RemoteBlockDownloadFileManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleBlockFetcherIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/ExecutorDiskUtils/","title":"ExecutorDiskUtils","text":""},{"location":"shuffle/ExternalAppendOnlyMap/","title":"ExternalAppendOnlyMap","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExternalAppendOnlyMap is a Spillable of SizeTrackers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExternalAppendOnlyMap[K, V, C] is a parameterized type of K keys, V values, and C combiner (partial) values.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/ExternalAppendOnlyMap/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExternalAppendOnlyMap takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • [[createCombiner]] createCombiner function (V => C)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • [[mergeValue]] mergeValue function ((C, V) => C)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • [[mergeCombiners]] mergeCombiners function ((C, C) => C)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • [[serializer]] Optional serializer:Serializer.md[Serializer] (default: core:SparkEnv.md#serializer[system Serializer])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • [[blockManager]] Optional storage:BlockManager.md[BlockManager] (default: core:SparkEnv.md#blockManager[system BlockManager])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • [[context]] TaskContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • [[serializerManager]] Optional serializer:SerializerManager.md[SerializerManager] (default: core:SparkEnv.md#serializerManager[system SerializerManager])

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExternalAppendOnlyMap is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Aggregator is requested to rdd:Aggregator.md#combineValuesByKey[combineValuesByKey] and rdd:Aggregator.md#combineCombinersByKey[combineCombinersByKey]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • CoGroupedRDD is requested to compute a partition

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    == [[currentMap]] SizeTrackingAppendOnlyMap

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ExternalAppendOnlyMap manages a SizeTrackingAppendOnlyMap.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    A SizeTrackingAppendOnlyMap is created immediately when ExternalAppendOnlyMap is and every time when <> and <> spilled to disk.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    SizeTrackingAppendOnlyMap are dereferenced (nulled) for the memory to be garbage-collected when <> and <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    SizeTrackingAppendOnlyMap is used when <>, <>, <> and <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    == [[insertAll]] Inserting All Key-Value Pairs (from Iterator)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    insertAll( entries: Iterator[Product2[K, V]]): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    [[insertAll-update-function]] insertAll creates an update function that uses the <> function for an existing value or the <> function for a new value.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    For every key-value pair (from the input iterator), insertAll does the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Requests the <> for the estimated size and, if greater than the <<_peakMemoryUsedBytes, _peakMemoryUsedBytes>> metric, updates it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • shuffle:Spillable.md#maybeSpill[Spills to a disk if necessary] and, if spilled, creates a new <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Requests the <> to change value for the current value (with the <> function)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • shuffle:Spillable.md#addElementsRead[Increments the elements read counter]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • === [[insertAll-usage]] Usage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      insertAll is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Aggregator is requested to rdd:Aggregator.md#combineValuesByKey[combineValuesByKey] and rdd:Aggregator.md#combineCombinersByKey[combineCombinersByKey]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • CoGroupedRDD is requested to compute a partition

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ExternalAppendOnlyMap is requested to <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[insertAll-requirements]] Requirements

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        insertAll throws an IllegalStateException when the <> internal registry is null:"},{"location":"shuffle/ExternalAppendOnlyMap/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"shuffle/ExternalAppendOnlyMap/#cannot-insert-new-elements-into-a-map-after-calling-iterator","title":"Cannot insert new elements into a map after calling iterator","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[iterator]] Iterator of \"Combined\" Pairs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_1","title":"[source, scala]","text":""},{"location":"shuffle/ExternalAppendOnlyMap/#iterator-iteratork-c","title":"iterator: Iterator[(K, C)]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        iterator...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        iterator is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[spill]] Spilling to Disk if Necessary

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spill( collection: SizeTracker): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spill...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spill is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[forceSpill]] Forcing Disk Spilling

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_3","title":"[source, scala]","text":""},{"location":"shuffle/ExternalAppendOnlyMap/#forcespill-boolean","title":"forceSpill(): Boolean","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        forceSpill returns a flag to indicate whether spilling to disk has really happened (true) or not (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        forceSpill branches off per the current state it is in (and should rather use a state-aware implementation).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When a <> is in use, forceSpill requests it to spill and, if it did, dereferences (nullify) the <>. forceSpill returns whatever the spilling of the <> returned.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When there is at least one element in the <>, forceSpill <> it. forceSpill then creates a new <> and always returns true.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In other cases, forceSpill simply returns false.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        forceSpill is part of the shuffle:Spillable.md[Spillable] abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[freeCurrentMap]] Freeing Up SizeTrackingAppendOnlyMap and Releasing Memory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_4","title":"[source, scala]","text":""},{"location":"shuffle/ExternalAppendOnlyMap/#freecurrentmap-unit","title":"freeCurrentMap(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        freeCurrentMap dereferences (nullify) the <> (if there still was one) followed by shuffle:Spillable.md#releaseMemory[releasing all memory].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        freeCurrentMap is used when SpillableIterator is requested to destroy itself.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[spillMemoryIteratorToDisk]] spillMemoryIteratorToDisk Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_5","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spillMemoryIteratorToDisk( inMemoryIterator: Iterator[(K, C)]): DiskMapIterator

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spillMemoryIteratorToDisk...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        spillMemoryIteratorToDisk is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ExternalSorter/","title":"ExternalSorter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ExternalSorter is a Spillable of WritablePartitionedPairCollection of pairs (of K keys and C values).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ExternalSorter[K, V, C] is a parameterized type of K keys, V values, and C combiner (partial) values.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ExternalSorter is used for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SortShuffleWriter to write records
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockStoreShuffleReader to read records (with a key ordering defined)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ExternalSorter/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ExternalSorter takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TaskContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Optional Aggregator (default: undefined)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Optional Partitioner (default: undefined)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Optional Ordering (Scala) for keys (default: undefined)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Serializer (default: Serializer)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ExternalSorter is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BlockStoreShuffleReader is requested to read records (for a reduce task)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SortShuffleWriter is requested to write records (as a ExternalSorter[K, V, C] or ExternalSorter[K, V, V] based on Map-Size Partial Aggregation Flag)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ExternalSorter/#inserting-records","title":"Inserting Records
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          insertAll(\n  records: Iterator[Product2[K, V]]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          insertAll branches off per whether the optional Aggregator was specified or not (when creating the ExternalSorter).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          insertAll takes all records eagerly and materializes the given records iterator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ExternalSorter/#map-side-aggregator-specified","title":"Map-Side Aggregator Specified

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          With an Aggregator given, insertAll creates an update function based on the mergeValue and createCombiner functions of the Aggregator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          For every record, insertAll increment internal read counter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          insertAll requests the PartitionedAppendOnlyMap to changeValue for the key (made up of the partition of the key of the current record and the key itself, i.e. (partition, key)) with the update function.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In the end, insertAll spills the in-memory collection to disk if needed with the usingMap flag enabled (to indicate that the PartitionedAppendOnlyMap was updated).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ExternalSorter/#no-map-side-aggregator-specified","title":"No Map-Side Aggregator Specified

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          With no Aggregator given, insertAll iterates over all the records and uses the PartitionedPairBuffer instead.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          For every record, insertAll increment internal read counter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          insertAll requests the PartitionedPairBuffer to insert with the partition of the key of the current record, the key itself and the value of the current record.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In the end, insertAll spills the in-memory collection to disk if needed with the usingMap flag disabled (since this time the PartitionedPairBuffer was updated, not the PartitionedAppendOnlyMap).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ExternalSorter/#spilling-in-memory-collection-to-disk","title":"Spilling In-Memory Collection to Disk
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          maybeSpillCollection(\n  usingMap: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          maybeSpillCollection branches per the input usingMap flag (to indicate which in-memory collection to use, the PartitionedAppendOnlyMap or the PartitionedPairBuffer).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          maybeSpillCollection requests the collection to estimate size (in bytes) that is tracked as the peakMemoryUsedBytes metric (for every size bigger than what is currently recorded).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          maybeSpillCollection spills the collection to disk if needed. If spilled, maybeSpillCollection creates a new collection (a new PartitionedAppendOnlyMap or a new PartitionedPairBuffer).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ExternalSorter/#usage","title":"Usage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          insertAll is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SortShuffleWriter is requested to write records (as a ExternalSorter[K, V, C] or ExternalSorter[K, V, V] based on Map-Size Partial Aggregation Flag)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BlockStoreShuffleReader is requested to read records (with a key sorting defined)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ExternalSorter/#in-memory-collections-of-records","title":"In-Memory Collections of Records

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ExternalSorter uses PartitionedPairBuffers or PartitionedAppendOnlyMaps to store records in memory before spilling to disk.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ExternalSorter uses PartitionedPairBuffers when created with no Aggregator. Otherwise, ExternalSorter uses PartitionedAppendOnlyMaps.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ExternalSorter inserts records to the collections when insertAll.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ExternalSorter spills the in-memory collection to disk if needed and, if so, creates a new collection.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ExternalSorter releases the collections (nulls them) when requested to forceSpill and stop. That is when the JVM garbage collector takes care of evicting them from memory completely.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ExternalSorter/#peak-size-of-in-memory-collection","title":"Peak Size of In-Memory Collection

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ExternalSorter tracks the peak size (in bytes) of the in-memory collection whenever requested to spill the in-memory collection to disk if needed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The peak size is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BlockStoreShuffleReader is requested to read combined records for a reduce task (with an ordering defined)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ExternalSorter is requested to writePartitionedMapOutput
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ExternalSorter/#spills","title":"Spills
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          spills: ArrayBuffer[SpilledFile]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ExternalSorter creates the spills internal buffer of SpilledFiles when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          A new SpilledFile is added when ExternalSorter is requested to spill.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          No elements in spills indicate that there is only in-memory data.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SpilledFiles are deleted physically from disk and the spills buffer is cleared when ExternalSorter is requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ExternalSorter uses the spills buffer when requested for an partitionedIterator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ExternalSorter/#number-of-spills","title":"Number of Spills
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          numSpills: Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          numSpills is the number of spill files this sorter has spilled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ExternalSorter/#spilledfile","title":"SpilledFile

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SpilledFile is a metadata of a spilled file:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • File (Java)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BlockId
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Serializer Batch Sizes (Array[Long])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Elements per Partition (Array[Long])","text":""},{"location":"shuffle/ExternalSorter/#spilling-data-to-disk","title":"Spilling Data to Disk
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spill(\n  collection: WritablePartitionedPairCollection[K, C]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spill is part of the Spillable abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spill requests the given WritablePartitionedPairCollection for a destructive WritablePartitionedIterator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spill spillMemoryIteratorToDisk (with the destructive WritablePartitionedIterator) that creates a SpilledFile.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In the end, spill adds the SpilledFile to the spills internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/ExternalSorter/#spillmemoryiteratortodisk","title":"spillMemoryIteratorToDisk
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spillMemoryIteratorToDisk(\n  inMemoryIterator: WritablePartitionedIterator): SpilledFile\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spillMemoryIteratorToDisk...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spillMemoryIteratorToDisk is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ExternalSorter is requested to spill
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SpillableIterator is requested to spill
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/ExternalSorter/#partitionediterator","title":"partitionedIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            partitionedIterator...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            partitionedIterator is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ExternalSorter is requested for an iterator and to writePartitionedMapOutput
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/ExternalSorter/#writepartitionedmapoutput","title":"writePartitionedMapOutput
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            writePartitionedMapOutput(\n  shuffleId: Int,\n  mapId: Long,\n  mapOutputWriter: ShuffleMapOutputWriter): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            writePartitionedMapOutput...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            writePartitionedMapOutput is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SortShuffleWriter is requested to write records
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/ExternalSorter/#iterator","title":"Iterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            iterator: Iterator[Product2[K, C]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            iterator turns the isShuffleSort flag off (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            iterator partitionedIterator and takes the combined values (the second elements) only.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            iterator is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockStoreShuffleReader is requested to read combined records for a reduce task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/ExternalSorter/#stopping-externalsorter","title":"Stopping ExternalSorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            stop...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            stop is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockStoreShuffleReader is requested to read records (with ordering defined)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SortShuffleWriter is requested to stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/ExternalSorter/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Enable ALL logging level for org.apache.spark.util.collection.ExternalSorter logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            log4j.logger.org.apache.spark.util.collection.ExternalSorter=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/FetchFailedException/","title":"FetchFailedException","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            FetchFailedException exception may be thrown when a task runs (and ShuffleBlockFetcherIterator could not fetch shuffle blocks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When FetchFailedException is reported, TaskRunner catches it and notifies the ExecutorBackend (with TaskState.FAILED task state).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"shuffle/FetchFailedException/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            FetchFailedException takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManagerId
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Shuffle ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Map ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Map Index
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Reduce ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Error Message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Error Cause
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • While being created, FetchFailedException requests the current TaskContext to setFetchFailed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              FetchFailedException is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ShuffleBlockFetcherIterator is requested to throw a FetchFailedException (for a ShuffleBlockId or a ShuffleBlockBatchId)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/FetchFailedException/#error-cause","title":"Error Cause

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              FetchFailedException can be given an error cause when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The root cause of the FetchFailedException is usually because the Executor (with the BlockManager for requested shuffle blocks) is lost and no longer available due to the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1. OutOfMemoryError could be thrown (aka OOMed) or some other unhandled exception
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              2. The cluster manager that manages the workers with the executors of your Spark application (e.g. Kubernetes, Hadoop YARN) enforces the container memory limits and eventually decides to kill the executor due to excessive memory usage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              A solution is usually to tune the memory of your Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/FetchFailedException/#taskcontext","title":"TaskContext

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              TaskContext comes with setFetchFailed and fetchFailed to hold a FetchFailedException unmodified (regardless of what happens in a user code).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/","title":"IndexShuffleBlockResolver","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              IndexShuffleBlockResolver is a ShuffleBlockResolver that manages shuffle block data and uses shuffle index files for faster shuffle data access.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/IndexShuffleBlockResolver/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              IndexShuffleBlockResolver takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                IndexShuffleBlockResolver is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SortShuffleManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • LocalDiskShuffleExecutorComponents is requested to initializeExecutor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"shuffle/IndexShuffleBlockResolver/#getstoredshuffles","title":"getStoredShuffles
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getStoredShuffles(): Seq[ShuffleBlockInfo]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getStoredShuffles\u00a0is part of the MigratableResolver abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getStoredShuffles...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#putshuffleblockasstream","title":"putShuffleBlockAsStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                putShuffleBlockAsStream(\n  blockId: BlockId,\n  serializerManager: SerializerManager): StreamCallbackWithID\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                putShuffleBlockAsStream\u00a0is part of the MigratableResolver abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                putShuffleBlockAsStream...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#getmigrationblocks","title":"getMigrationBlocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getMigrationBlocks(\n  shuffleBlockInfo: ShuffleBlockInfo): List[(BlockId, ManagedBuffer)]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getMigrationBlocks\u00a0is part of the MigratableResolver abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getMigrationBlocks...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#writing-shuffle-index-and-data-files","title":"Writing Shuffle Index and Data Files
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                writeIndexFileAndCommit(\n  shuffleId: Int,\n  mapId: Long,\n  lengths: Array[Long],\n  dataTmp: File): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                writeIndexFileAndCommit finds the index and data files for the input shuffleId and mapId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                writeIndexFileAndCommit creates a temporary file for the index file (in the same directory) and writes offsets (as the moving sum of the input lengths starting from 0 to the final offset at the end for the end of the output file).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The offsets are the sizes in the input lengths exactly.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                writeIndexFileAndCommit...FIXME (Review me)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                writeIndexFileAndCommit <> for the input shuffleId and mapId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                writeIndexFileAndCommit <> (aka consistency check).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                If the consistency check fails, it means that another attempt for the same task has already written the map outputs successfully and so the input dataTmp and temporary index files are deleted (as no longer correct).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                If the consistency check succeeds, the existing index and data files are deleted (if they exist) and the temporary index and data files become \"official\", i.e. renamed to their final names.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In case of any IO-related exception, writeIndexFileAndCommit throws a IOException with the messages:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                fail to rename file [indexTmp] to [indexFile]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                or

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                fail to rename file [dataTmp] to [dataFile]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                writeIndexFileAndCommit\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • LocalDiskShuffleMapOutputWriter is requested to commitAllPartitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • LocalDiskSingleSpillMapOutputWriter is requested to transferMapSpillFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#removing-shuffle-index-and-data-files","title":"Removing Shuffle Index and Data Files
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                removeDataByMap(\n  shuffleId: Int,\n  mapId: Long): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                removeDataByMap finds and deletes the shuffle data file (for the input shuffleId and mapId) followed by finding and deleting the shuffle data index file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                removeDataByMap\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SortShuffleManager is requested to unregister a shuffle (and remove a shuffle from a shuffle system)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#creating-shuffle-block-index-file","title":"Creating Shuffle Block Index File
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getIndexFile(\n  shuffleId: Int,\n  mapId: Long,\n  dirs: Option[Array[String]] = None): File\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getIndexFile creates a ShuffleIndexBlockId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                With dirs local directories defined, getIndexFile places the index file of the ShuffleIndexBlockId (by the name) in the local directories (with the spark.diskStore.subDirectories).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Otherwise, with no local directories, getIndexFile requests the DiskBlockManager (of the BlockManager) to get the data file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getIndexFile\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • IndexShuffleBlockResolver is requested to getBlockData, removeDataByMap, putShuffleBlockAsStream, getMigrationBlocks, writeIndexFileAndCommit
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • FallbackStorage is requested to copy
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#creating-shuffle-block-data-file","title":"Creating Shuffle Block Data File
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getDataFile(\n  shuffleId: Int,\n  mapId: Long): File // (1)\ngetDataFile(\n  shuffleId: Int,\n  mapId: Long,\n  dirs: Option[Array[String]]): File\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                1. dirs is None (undefined)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getDataFile creates a ShuffleDataBlockId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                With dirs local directories defined, getDataFile places the data file of the ShuffleDataBlockId (by the name) in the local directories (with the spark.diskStore.subDirectories).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Otherwise, with no local directories, getDataFile requests the DiskBlockManager (of the BlockManager) to get the data file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getDataFile\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • IndexShuffleBlockResolver is requested to getBlockData, removeDataByMap, putShuffleBlockAsStream, getMigrationBlocks, writeIndexFileAndCommit
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • LocalDiskShuffleMapOutputWriter is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • LocalDiskSingleSpillMapOutputWriter is requested to transferMapSpillFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • FallbackStorage is requested to copy
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#creating-managedbuffer-to-read-shuffle-block-data-file","title":"Creating ManagedBuffer to Read Shuffle Block Data File
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getBlockData(\n  blockId: BlockId,\n  dirs: Option[Array[String]]): ManagedBuffer\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getBlockData\u00a0is part of the ShuffleBlockResolver abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getBlockData...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#checking-consistency-of-shuffle-index-and-data-files","title":"Checking Consistency of Shuffle Index and Data Files
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                checkIndexAndDataFile(\n  index: File,\n  data: File,\n  blocks: Int): Array[Long]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Danger

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Review Me

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                checkIndexAndDataFile first checks if the size of the input index file is exactly the input blocks multiplied by 8.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                checkIndexAndDataFile returns null when the numbers, and hence the shuffle index and data files, don't match.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                checkIndexAndDataFile reads the shuffle index file and converts the offsets into lengths of each block.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                checkIndexAndDataFile makes sure that the size of the input shuffle data file is exactly the sum of the block lengths.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                checkIndexAndDataFile returns the block lengths if the numbers match, and null otherwise.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#transportconf","title":"TransportConf

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                IndexShuffleBlockResolver creates a TransportConf (for shuffle module) when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                transportConf\u00a0is used in getMigrationBlocks and getBlockData.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Enable ALL logging level for org.apache.spark.shuffle.IndexShuffleBlockResolver logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                log4j.logger.org.apache.spark.shuffle.IndexShuffleBlockResolver=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/LocalDiskShuffleDataIO/","title":"LocalDiskShuffleDataIO","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                LocalDiskShuffleDataIO is a ShuffleDataIO.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"shuffle/LocalDiskShuffleDataIO/#shuffleexecutorcomponents","title":"ShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ShuffleExecutorComponents executor()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                executor\u00a0is part of the ShuffleDataIO abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                executor creates a new LocalDiskShuffleExecutorComponents.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/LocalDiskShuffleExecutorComponents/","title":"LocalDiskShuffleExecutorComponents","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                LocalDiskShuffleExecutorComponents is a ShuffleExecutorComponents.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"shuffle/LocalDiskShuffleExecutorComponents/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                LocalDiskShuffleExecutorComponents takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkConf

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  LocalDiskShuffleExecutorComponents is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • LocalDiskShuffleDataIO is requested for a ShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/LocalDiskShuffleMapOutputWriter/","title":"LocalDiskShuffleMapOutputWriter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  LocalDiskShuffleMapOutputWriter is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/LocalDiskSingleSpillMapOutputWriter/","title":"LocalDiskSingleSpillMapOutputWriter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  LocalDiskSingleSpillMapOutputWriter is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/MigratableResolver/","title":"MigratableResolver","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  MigratableResolver is an abstraction of resolvers that allow Spark to migrate shuffle blocks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/MigratableResolver/#contract","title":"Contract","text":""},{"location":"shuffle/MigratableResolver/#getmigrationblocks","title":"getMigrationBlocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getMigrationBlocks(\n  shuffleBlockInfo: ShuffleBlockInfo): List[(BlockId, ManagedBuffer)]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ShuffleMigrationRunnable is requested to run
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/MigratableResolver/#getstoredshuffles","title":"getStoredShuffles
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getStoredShuffles(): Seq[ShuffleBlockInfo]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManagerDecommissioner is requested to refreshOffloadingShuffleBlocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/MigratableResolver/#putshuffleblockasstream","title":"putShuffleBlockAsStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  putShuffleBlockAsStream(\n  blockId: BlockId,\n  serializerManager: SerializerManager): StreamCallbackWithID\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManager is requested to putBlockDataAsStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/MigratableResolver/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • IndexShuffleBlockResolver
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/SerializedShuffleHandle/","title":"SerializedShuffleHandle","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  SerializedShuffleHandle is a BaseShuffleHandle that SortShuffleManager uses when canUseSerializedShuffle (when requested to register a shuffle and BypassMergeSortShuffleHandles could not be selected).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  SerializedShuffleHandle tells SortShuffleManager to use UnsafeShuffleWriter when requested for a ShuffleWriter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/SerializedShuffleHandle/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  SerializedShuffleHandle takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Shuffle ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ShuffleDependency

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    SerializedShuffleHandle is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SortShuffleManager is requested for a ShuffleHandle (for the ShuffleDependency)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/ShuffleBlockPusher/","title":"ShuffleBlockPusher","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleBlockPusher is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/ShuffleBlockResolver/","title":"ShuffleBlockResolver","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    = [[ShuffleBlockResolver]] ShuffleBlockResolver

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleBlockResolver is an <> of <> that storage:BlockManager.md[BlockManager] uses to <> for a logical shuffle block identifier (i.e. map, reduce, and shuffle).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NOTE: Shuffle block data files are often referred to as map outputs files.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    [[implementations]] NOTE: shuffle:IndexShuffleBlockResolver.md[IndexShuffleBlockResolver] is the default and only known ShuffleBlockResolver in Apache Spark.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    [[contract]] .ShuffleBlockResolver Contract [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | getBlockData a| [[getBlockData]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/ShuffleBlockResolver/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getBlockData( blockId: ShuffleBlockId): ManagedBuffer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Retrieves the data (as a ManagedBuffer) for the given storage:BlockId.md#ShuffleBlockId[block] (a tuple of shuffleId, mapId and reduceId).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when BlockManager is requested to retrieve a storage:BlockManager.md#getLocalBytes[block data from a local block manager] and storage:BlockManager.md#getBlockData[block data]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | stop a| [[stop]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/ShuffleBlockResolver/#source-scala_1","title":"[source, scala]","text":""},{"location":"shuffle/ShuffleBlockResolver/#stop-unit","title":"stop(): Unit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Stops the ShuffleBlockResolver

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when SortShuffleManager is requested to SortShuffleManager.md#stop[stop]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/ShuffleDataIO/","title":"ShuffleDataIO","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleDataIO is an abstraction of pluggable temporary shuffle block store plugins for storing shuffle blocks in arbitrary storage backends.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/ShuffleDataIO/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleDataIO/#shuffledrivercomponents","title":"ShuffleDriverComponents
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleDriverComponents driver()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/ShuffleDataIO/#shuffleexecutorcomponents","title":"ShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleExecutorComponents executor()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SortShuffleManager utility is used to load the ShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/ShuffleDataIO/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • LocalDiskShuffleDataIO
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/ShuffleDataIOUtils/","title":"ShuffleDataIOUtils","text":""},{"location":"shuffle/ShuffleDataIOUtils/#loading-shuffledataio","title":"Loading ShuffleDataIO
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    loadShuffleDataIO(\n  conf: SparkConf): ShuffleDataIO\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    loadShuffleDataIO uses the spark.shuffle.sort.io.plugin.class configuration property to load the ShuffleDataIO.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    loadShuffleDataIO\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SortShuffleManager utility is used to loadShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/ShuffleDriverComponents/","title":"ShuffleDriverComponents","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleDriverComponents is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/ShuffleExecutorComponents/","title":"ShuffleExecutorComponents","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleExecutorComponents is an abstraction of executor shuffle builders.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/ShuffleExecutorComponents/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleExecutorComponents/#createmapoutputwriter","title":"createMapOutputWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleMapOutputWriter createMapOutputWriter(\n  int shuffleId,\n  long mapTaskId,\n  int numPartitions) throws IOException\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Creates a ShuffleMapOutputWriter

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BypassMergeSortShuffleWriter is requested to write records
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • UnsafeShuffleWriter is requested to mergeSpills and mergeSpillsUsingStandardWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SortShuffleWriter is requested to write records
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/ShuffleExecutorComponents/#createsinglefilemapoutputwriter","title":"createSingleFileMapOutputWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Optional<SingleSpillShuffleMapOutputWriter> createSingleFileMapOutputWriter(\n  int shuffleId,\n  long mapId) throws IOException\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Creates a SingleSpillShuffleMapOutputWriter

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Default: empty

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • UnsafeShuffleWriter is requested to mergeSpills
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/ShuffleExecutorComponents/#initializeexecutor","title":"initializeExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    void initializeExecutor(\n  String appId,\n  String execId,\n  Map<String, String> extraConfigs);\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SortShuffleManager utility is used to loadShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/ShuffleExecutorComponents/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • LocalDiskShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/ShuffleExternalSorter/","title":"ShuffleExternalSorter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleExternalSorter is a specialized cache-efficient sorter that sorts arrays of compressed record pointers and partition ids.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleExternalSorter uses only 8 bytes of space per record in the sorting array to fit more of the array into cache.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleExternalSorter is created and used by UnsafeShuffleWriter only.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"shuffle/ShuffleExternalSorter/#memoryconsumer","title":"MemoryConsumer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleExternalSorter is a MemoryConsumer with page size of 128 MB (unless TaskMemoryManager uses smaller).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleExternalSorter can spill to disk to free up execution memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/ShuffleExternalSorter/#configuration-properties","title":"Configuration Properties","text":""},{"location":"shuffle/ShuffleExternalSorter/#sparkshufflefilebuffer","title":"spark.shuffle.file.buffer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleExternalSorter uses spark.shuffle.file.buffer configuration property for...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/ShuffleExternalSorter/#sparkshufflespillnumelementsforcespillthreshold","title":"spark.shuffle.spill.numElementsForceSpillThreshold

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleExternalSorter uses spark.shuffle.spill.numElementsForceSpillThreshold configuration property for...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"shuffle/ShuffleExternalSorter/#creating-instance","title":"Creating Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleExternalSorter takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • TaskMemoryManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BlockManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • TaskContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Initial Size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Number of Partitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleWriteMetricsReporter

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ShuffleExternalSorter is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • UnsafeShuffleWriter is requested to open a ShuffleExternalSorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/ShuffleExternalSorter/#shuffleinmemorysorter","title":"ShuffleInMemorySorter

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ShuffleExternalSorter manages a ShuffleInMemorySorter:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ShuffleInMemorySorter is created immediately when ShuffleExternalSorter is

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ShuffleInMemorySorter is requested to free up memory and dereferenced (nulled) when ShuffleExternalSorter is requested to cleanupResources and closeAndGetSpills

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ShuffleExternalSorter uses the ShuffleInMemorySorter for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • writeSortedFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • spill
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • getMemoryUsage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • growPointerArrayIfNecessary
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • insertRecord
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/ShuffleExternalSorter/#spilling-to-disk","title":"Spilling To Disk
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      long spill(\n  long size,\n  MemoryConsumer trigger)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      spill is part of the MemoryConsumer abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      spill returns the memory bytes spilled (spill size).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      spill prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Thread [threadId] spilling sort data of [memoryUsage] to disk ([spillsSize] [time|times] so far)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      spill writeSortedFile (with the isLastFile flag disabled).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      spill frees up execution memory (and records the memory bytes spilled as spillSize).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      spill requests the ShuffleInMemorySorter to reset.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In the end, spill requests the TaskContext for TaskMetrics to increase the memory bytes spilled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/ShuffleExternalSorter/#closeandgetspills","title":"closeAndGetSpills
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      SpillInfo[] closeAndGetSpills()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      closeAndGetSpills...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      closeAndGetSpills is used when UnsafeShuffleWriter is requested to closeAndWriteOutput.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/ShuffleExternalSorter/#getmemoryusage","title":"getMemoryUsage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      long getMemoryUsage()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getMemoryUsage...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getMemoryUsage is used when ShuffleExternalSorter is created and requested to spill and updatePeakMemoryUsed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/ShuffleExternalSorter/#updatepeakmemoryused","title":"updatePeakMemoryUsed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      void updatePeakMemoryUsed()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      updatePeakMemoryUsed...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      updatePeakMemoryUsed is used when ShuffleExternalSorter is requested to getPeakMemoryUsedBytes and freeMemory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/ShuffleExternalSorter/#writesortedfile","title":"writeSortedFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      void writeSortedFile(\n  boolean isLastFile)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      writeSortedFile...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      writeSortedFile is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ShuffleExternalSorter is requested to spill and closeAndGetSpills
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/ShuffleExternalSorter/#cleanupresources","title":"cleanupResources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      void cleanupResources()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      cleanupResources...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      cleanupResources is used when UnsafeShuffleWriter is requested to write records and stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/ShuffleExternalSorter/#inserting-serialized-record-into-shuffleinmemorysorter","title":"Inserting Serialized Record Into ShuffleInMemorySorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      void insertRecord(\n  Object recordBase,\n  long recordOffset,\n  int length,\n  int partitionId)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      insertRecord...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      insertRecord growPointerArrayIfNecessary.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      insertRecord...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      insertRecord acquireNewPageIfNecessary.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      insertRecord...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      insertRecord is used when UnsafeShuffleWriter is requested to insertRecordIntoSorter

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/ShuffleExternalSorter/#growpointerarrayifnecessary","title":"growPointerArrayIfNecessary
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      void growPointerArrayIfNecessary()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      growPointerArrayIfNecessary...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/ShuffleExternalSorter/#acquirenewpageifnecessary","title":"acquireNewPageIfNecessary
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      void acquireNewPageIfNecessary(\n  int required)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      acquireNewPageIfNecessary...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/ShuffleExternalSorter/#freememory","title":"freeMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      long freeMemory()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      freeMemory...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      freeMemory is used when ShuffleExternalSorter is requested to spill, cleanupResources, and closeAndGetSpills.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/ShuffleExternalSorter/#peak-memory-used","title":"Peak Memory Used
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      long getPeakMemoryUsedBytes()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getPeakMemoryUsedBytes...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getPeakMemoryUsedBytes is used when UnsafeShuffleWriter is requested to updatePeakMemoryUsed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/ShuffleExternalSorter/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Enable ALL logging level for org.apache.spark.shuffle.sort.ShuffleExternalSorter logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      log4j.logger.org.apache.spark.shuffle.sort.ShuffleExternalSorter=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"shuffle/ShuffleHandle/","title":"ShuffleHandle","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ShuffleHandle is an abstraction of shuffle handles for ShuffleManager to pass information about shuffles to tasks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ShuffleHandle is Serializable (Java).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"shuffle/ShuffleHandle/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • BaseShuffleHandle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"shuffle/ShuffleHandle/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ShuffleHandle takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Shuffle ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ShuffleHandle is an abstract class and cannot be created directly. It is created indirectly for the concrete ShuffleHandles.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"shuffle/ShuffleInMemorySorter/","title":"ShuffleInMemorySorter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ShuffleInMemorySorter is used by ShuffleExternalSorter to <> using <> sort algorithms.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[creating-instance]] Creating Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ShuffleInMemorySorter takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • [[consumer]] memory:MemoryConsumer.md[MemoryConsumer]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • [[initialSize]] Initial size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • [[useRadixSort]] useRadixSort flag (to indicate whether to use <>)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleInMemorySorter requests the given <> to memory:MemoryConsumer.md#allocateArray[allocate an array] of the given <> for the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleInMemorySorter is created for a shuffle:ShuffleExternalSorter.md#inMemSorter[ShuffleExternalSorter].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[getSortedIterator]] Iterator of Records Sorted

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleInMemorySorter/#source-java","title":"[source, java]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#shufflesorteriterator-getsortediterator","title":"ShuffleSorterIterator getSortedIterator()","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getSortedIterator...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getSortedIterator is used when ShuffleExternalSorter is requested to shuffle:ShuffleExternalSorter.md#writeSortedFile[writeSortedFile].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[reset]] Resetting

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleInMemorySorter/#source-java_1","title":"[source, java]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#void-reset","title":"void reset()","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          reset...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          reset is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[numRecords]] numRecords Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleInMemorySorter/#source-java_2","title":"[source, java]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#int-numrecords","title":"int numRecords()","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          numRecords...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          numRecords is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[getUsableCapacity]] Calculating Usable Capacity

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleInMemorySorter/#source-java_3","title":"[source, java]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#int-getusablecapacity","title":"int getUsableCapacity()","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getUsableCapacity calculates the capacity that is a half or two-third of the memory used for the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getUsableCapacity is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[logging]] Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Enable ALL logging level for org.apache.spark.shuffle.sort.ShuffleExternalSorter logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleInMemorySorter/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#log4jloggerorgapachesparkshufflesortshuffleexternalsorterall","title":"log4j.logger.org.apache.spark.shuffle.sort.ShuffleExternalSorter=ALL","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Refer to spark-logging.md[Logging].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[internal-properties]] Internal Properties

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[array]] Unsafe LongArray of Record Pointers and Partition IDs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleInMemorySorter uses a LongArray.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          === [[usableCapacity]] Usable Capacity

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleInMemorySorter...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleManager/","title":"ShuffleManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleManager is an abstraction of shuffle managers that manage shuffle data.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleManager is specified using spark.shuffle.manager configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleManager is used to create a BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleManager/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleManager/#getting-shufflereader-for-shufflehandle","title":"Getting ShuffleReader for ShuffleHandle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getReader[K, C](\n  handle: ShuffleHandle,\n  startPartition: Int,\n  endPartition: Int,\n  context: TaskContext,\n  metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleReader to read shuffle data (for the given ShuffleHandle)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when the following RDDs are requested to compute a partition:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • CoGroupedRDD is requested to compute a partition
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ShuffledRDD is requested to compute a partition
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ShuffledRowRDD (Spark SQL) is requested to compute a partition
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleManager/#getreaderforrange","title":"getReaderForRange
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getReaderForRange[K, C](\n  handle: ShuffleHandle,\n  startMapIndex: Int,\n  endMapIndex: Int,\n  startPartition: Int,\n  endPartition: Int,\n  context: TaskContext,\n  metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleReader for a range of reduce partitions to read from map output in the ShuffleHandle

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when ShuffledRowRDD (Spark SQL) is requested to compute a partition

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleManager/#getting-shufflewriter-for-shufflehandle","title":"Getting ShuffleWriter for ShuffleHandle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getWriter[K, V](\n  handle: ShuffleHandle,\n  mapId: Long,\n  context: TaskContext,\n  metrics: ShuffleWriteMetricsReporter): ShuffleWriter[K, V]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleWriter to write shuffle data in the ShuffleHandle

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when ShuffleWriteProcessor is requested to write a partition

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleManager/#registering-shuffle-of-shuffledependency-and-getting-shufflehandle","title":"Registering Shuffle of ShuffleDependency (and Getting ShuffleHandle)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          registerShuffle[K, V, C](\n  shuffleId: Int,\n  dependency: ShuffleDependency[K, V, C]): ShuffleHandle\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Registers a shuffle (by the given shuffleId and ShuffleDependency) and gives a ShuffleHandle

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when ShuffleDependency is created (and registers with the shuffle system)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleManager/#shuffleblockresolver","title":"ShuffleBlockResolver
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          shuffleBlockResolver: ShuffleBlockResolver\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleBlockResolver of the shuffle system

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SortShuffleManager is requested for a ShuffleWriter for a ShuffleHandle, to unregister a shuffle and stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BlockManager is requested to getLocalBlockData and getHostLocalShuffleData
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleManager/#stopping-shufflemanager","title":"Stopping ShuffleManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Stops the shuffle system

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when SparkEnv is requested to stop

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleManager/#unregistering-shuffle","title":"Unregistering Shuffle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          unregisterShuffle(\n  shuffleId: Int): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Unregisters a given shuffle

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when BlockManagerSlaveEndpoint is requested to handle a RemoveShuffle message

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleManager/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SortShuffleManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleManager/#accessing-shufflemanager-using-sparkenv","title":"Accessing ShuffleManager using SparkEnv

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleManager is available on the driver and executors using SparkEnv.shuffleManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          val shuffleManager = SparkEnv.get.shuffleManager\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleMapOutputWriter/","title":"ShuffleMapOutputWriter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleMapOutputWriter is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleReader/","title":"ShuffleReader","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleReader is an abstraction of shuffle block readers that can read combined key-value records for a reduce task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleReader/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleReader/#reading-combined-records-for-reduce-task","title":"Reading Combined Records (for Reduce Task)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          read(): Iterator[Product2[K, C]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • CoGroupedRDD, ShuffledRDD are requested to compute a partition (for a ShuffleDependency dependency)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ShuffledRowRDD (Spark SQL) is requested to compute a partition
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleReader/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BlockStoreShuffleReader
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleWriteMetricsReporter/","title":"ShuffleWriteMetricsReporter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleWriteMetricsReporter is an abstraction of shuffle write metrics reporters.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleWriteMetricsReporter/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#decbyteswritten","title":"decBytesWritten
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          decBytesWritten(\n  v: Long): Unit\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#decrecordswritten","title":"decRecordsWritten
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          decRecordsWritten(\n  v: Long): Unit\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#incbyteswritten","title":"incBytesWritten
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          incBytesWritten(\n  v: Long): Unit\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#increcordswritten","title":"incRecordsWritten
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          incRecordsWritten(\n  v: Long): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          See ShuffleWriteMetrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ShufflePartitionPairsWriter is requested to recordWritten
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ShuffleExternalSorter is requested to writeSortedFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • DiskBlockObjectWriter is requested to record bytes written
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#incwritetime","title":"incWriteTime
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          incWriteTime(\n  v: Long): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BypassMergeSortShuffleWriter is requested to write partition records and writePartitionedData
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • UnsafeShuffleWriter is requested to mergeSpillsWithTransferTo
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • DiskBlockObjectWriter is requested to commitAndGet
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • TimeTrackingOutputStream is requested to write, flush, and close
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ShuffleWriteMetrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SQLShuffleWriteMetricsReporter (Spark SQL)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleWriteProcessor/","title":"ShuffleWriteProcessor","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleWriteProcessor controls write behavior in ShuffleMapTasks while writing partition records out to the shuffle system.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleWriteProcessor is used to create a ShuffleDependency.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleWriteProcessor/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleWriteProcessor takes no arguments to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleWriteProcessor is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ShuffleDependency is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ShuffleExchangeExec (Spark SQL) physical operator is requested to createShuffleWriteProcessor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleWriteProcessor/#writing-partition-records-to-shuffle-system","title":"Writing Partition Records to Shuffle System
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          write(\n  rdd: RDD[_],\n  dep: ShuffleDependency[_, _, _],\n  mapId: Long,\n  context: TaskContext,\n  partition: Partition): MapStatus\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          write requests the ShuffleManager for the ShuffleWriter for the ShuffleHandle (of the given ShuffleDependency).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          write requests the ShuffleWriter to write out records (of the given Partition and RDD).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In the end, write requests the ShuffleWriter to stop (with the success flag enabled).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In case of any Exceptions, write requests the ShuffleWriter to stop (with the success flag disabled).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          write\u00a0is used when ShuffleMapTask is requested to run.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleWriteProcessor/#creating-metricsreporter","title":"Creating MetricsReporter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          createMetricsReporter(\n  context: TaskContext): ShuffleWriteMetricsReporter\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          createMetricsReporter creates a ShuffleWriteMetricsReporter from the given TaskContext.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          createMetricsReporter requests the given TaskContext for TaskMetrics and then for the ShuffleWriteMetrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleWriter/","title":"ShuffleWriter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleWriter[K, V] (of K keys and V values) is an abstraction of shuffle writers that can write out key-value records (of a RDD partition) to a shuffle system.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ShuffleWriter is used when ShuffleMapTask is requested to run (and uses a ShuffleWriteProcessor to write partition records to a shuffle system).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/ShuffleWriter/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleWriter/#writing-out-partition-records-to-shuffle-system","title":"Writing Out Partition Records to Shuffle System
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          write(\n  records: Iterator[Product2[K, V]]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Writes key-value records (of a partition) out to a shuffle system

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ShuffleWriteProcessor is requested to write
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleWriter/#stopping-shufflewriter","title":"Stopping ShuffleWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          stop(\n  success: Boolean): Option[MapStatus]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Stops (closes) the ShuffleWriter and returns a MapStatus if the writing completed successfully. The success flag is the status of the task execution.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ShuffleWriteProcessor is requested to write
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"shuffle/ShuffleWriter/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BypassMergeSortShuffleWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SortShuffleWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • UnsafeShuffleWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/SingleSpillShuffleMapOutputWriter/","title":"SingleSpillShuffleMapOutputWriter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SingleSpillShuffleMapOutputWriter is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/SortShuffleManager/","title":"SortShuffleManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SortShuffleManager is the default and only ShuffleManager in Apache Spark (with the short name sort or tungsten-sort).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"shuffle/SortShuffleManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SortShuffleManager takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkConf

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SortShuffleManager is created when SparkEnv is created (on the driver and executors at the very beginning of a Spark application's lifecycle).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"shuffle/SortShuffleManager/#taskidmapsforshuffle-registry","title":"taskIdMapsForShuffle Registry
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            taskIdMapsForShuffle: ConcurrentHashMap[Int, OpenHashSet[Long]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SortShuffleManager uses taskIdMapsForShuffle internal registry to track task (attempt) IDs by shuffle.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A new shuffle and task IDs are added when SortShuffleManager is requested for a ShuffleWriter (for a partition and a ShuffleHandle).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A shuffle ID (and associated task IDs) are removed when SortShuffleManager is requested to unregister a shuffle.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/SortShuffleManager/#getting-shufflewriter-for-partition-and-shufflehandle","title":"Getting ShuffleWriter for Partition and ShuffleHandle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getWriter[K, V](\n  handle: ShuffleHandle,\n  mapId: Int,\n  context: TaskContext): ShuffleWriter[K, V]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getWriter registers the given ShuffleHandle (by the shuffleId and numMaps) in the taskIdMapsForShuffle internal registry unless already done.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getWriter expects that the input ShuffleHandle is a BaseShuffleHandle. Moreover, getWriter expects that in two (out of three cases) it is a more specialized IndexShuffleBlockResolver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getWriter then creates a new ShuffleWriter based on the type of the given ShuffleHandle.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ShuffleHandle ShuffleWriter SerializedShuffleHandle UnsafeShuffleWriter BypassMergeSortShuffleHandle BypassMergeSortShuffleWriter BaseShuffleHandle SortShuffleWriter

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getWriter is part of the ShuffleManager abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/SortShuffleManager/#shuffleexecutorcomponents","title":"ShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            shuffleExecutorComponents: ShuffleExecutorComponents\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SortShuffleManager defines the shuffleExecutorComponents internal registry for a ShuffleExecutorComponents.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            shuffleExecutorComponents\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SortShuffleManager is requested for the ShuffleWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/SortShuffleManager/#loadshuffleexecutorcomponents","title":"loadShuffleExecutorComponents
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            loadShuffleExecutorComponents(\n  conf: SparkConf): ShuffleExecutorComponents\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            loadShuffleExecutorComponents loads the ShuffleDataIO that is then requested for the ShuffleExecutorComponents.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            loadShuffleExecutorComponents requests the ShuffleExecutorComponents to initialize before returning it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/SortShuffleManager/#creating-shufflehandle-for-shuffledependency","title":"Creating ShuffleHandle for ShuffleDependency
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            registerShuffle[K, V, C](\n  shuffleId: Int,\n  dependency: ShuffleDependency[K, V, C]): ShuffleHandle\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            registerShuffle\u00a0is part of the ShuffleManager abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            registerShuffle creates a new ShuffleHandle (for the given ShuffleDependency) that is one of the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. BypassMergeSortShuffleHandle (with ShuffleDependency[K, V, V]) when shouldBypassMergeSort condition holds

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. SerializedShuffleHandle (with ShuffleDependency[K, V, V]) when canUseSerializedShuffle condition holds

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            3. BaseShuffleHandle

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/SortShuffleManager/#serializedshufflehandle-requirements","title":"SerializedShuffleHandle Requirements
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            canUseSerializedShuffle(\n  dependency: ShuffleDependency[_, _, _]): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            canUseSerializedShuffle is true when all of the following hold for the given ShuffleDependency:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. Serializer (of the given ShuffleDependency) supports relocation of serialized objects

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. mapSideCombine flag (of the given ShuffleDependency) is false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            3. Number of partitions (of the Partitioner of the given ShuffleDependency) is not greater than the supported maximum number

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            With all of the above positive, canUseSerializedShuffle prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Can use serialized shuffle for shuffle [shuffleId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Otherwise, canUseSerializedShuffle is false and prints out one of the following DEBUG messages based on the failed requirement:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Can't use serialized shuffle for shuffle [id] because the serializer, [name], does not support object relocation\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Can't use serialized shuffle for shuffle [id] because we need to do map-side aggregation\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Can't use serialized shuffle for shuffle [id] because it has more than [number] partitions\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            canUseSerializedShuffle\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SortShuffleManager is requested to register a shuffle (and creates a ShuffleHandle)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/SortShuffleManager/#maximum-number-of-partition-identifiers-for-serialized-mode","title":"Maximum Number of Partition Identifiers for Serialized Mode

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SortShuffleManager defines MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE internal constant to be (1 << 24) (16777216) for the maximum number of shuffle output partitions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • UnsafeShuffleWriter is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SortShuffleManager utility is used to check out SerializedShuffleHandle requirements
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffleExchangeExec (Spark SQL) utility is used to needToCopyObjectsBeforeShuffle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/SortShuffleManager/#creating-shuffleblockresolver","title":"Creating ShuffleBlockResolver
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            shuffleBlockResolver: IndexShuffleBlockResolver\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            shuffleBlockResolver\u00a0is part of the ShuffleManager abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            shuffleBlockResolver is a IndexShuffleBlockResolver (and is created immediately alongside this SortShuffleManager).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/SortShuffleManager/#unregistering-shuffle","title":"Unregistering Shuffle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            unregisterShuffle(\n  shuffleId: Int): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            unregisterShuffle\u00a0is part of the ShuffleManager abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            unregisterShuffle removes the given shuffleId from the taskIdMapsForShuffle internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the shuffleId was found and removed successfully, unregisterShuffle requests the IndexShuffleBlockResolver to remove the shuffle index and data files for every mapTaskId (mappers producing the output for the shuffle).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            unregisterShuffle is always true.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/SortShuffleManager/#getting-shufflereader-for-shufflehandle","title":"Getting ShuffleReader for ShuffleHandle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getReader[K, C](\n  handle: ShuffleHandle,\n  startMapIndex: Int,\n  endMapIndex: Int,\n  startPartition: Int,\n  endPartition: Int,\n  context: TaskContext,\n  metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getReader\u00a0is part of the ShuffleManager abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getReader requests the MapOutputTracker (via SparkEnv) for the getMapSizesByExecutorId for the shuffleId (of the given ShuffleHandle).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In the end, getReader creates a new BlockStoreShuffleReader.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/SortShuffleManager/#stopping-shufflemanager","title":"Stopping ShuffleManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            stop\u00a0is part of the ShuffleManager abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            stop requests the IndexShuffleBlockResolver to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/SortShuffleManager/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Enable ALL logging level for org.apache.spark.shuffle.sort.SortShuffleManager logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            log4j.logger.org.apache.spark.shuffle.sort.SortShuffleManager=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"shuffle/SortShuffleWriter/","title":"SortShuffleWriter \u2014 Fallback ShuffleWriter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SortShuffleWriter is a \"fallback\" ShuffleWriter (when SortShuffleManager is requested for a ShuffleWriter and the more specialized BypassMergeSortShuffleWriter and UnsafeShuffleWriter could not be used).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SortShuffleWriter[K, V, C] is a parameterized type with K keys, V values, and C combiner values.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"shuffle/SortShuffleWriter/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SortShuffleWriter takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • IndexShuffleBlockResolver (unused)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BaseShuffleHandle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Map ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TaskContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffleExecutorComponents

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SortShuffleWriter is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SortShuffleManager is requested for a ShuffleWriter (for a given ShuffleHandle)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/SortShuffleWriter/#mapstatus","title":"MapStatus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SortShuffleWriter uses mapStatus internal registry for a MapStatus after writing records.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Writing records itself does not return a value and SortShuffleWriter uses the registry when requested to stop (which allows returning a MapStatus).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/SortShuffleWriter/#writing-records-into-shuffle-partitioned-file-in-disk-store","title":"Writing Records (Into Shuffle Partitioned File In Disk Store)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              write(\n  records: Iterator[Product2[K, V]]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              write is part of the ShuffleWriter abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              write creates an ExternalSorter based on the ShuffleDependency (of the BaseShuffleHandle), namely the Map-Size Partial Aggregation flag. The ExternalSorter uses the aggregator and key ordering when the flag is enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              write requests the ExternalSorter to insert all the given records.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              write...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/SortShuffleWriter/#stopping-sortshufflewriter-and-calculating-mapstatus","title":"Stopping SortShuffleWriter (and Calculating MapStatus)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              stop(\n  success: Boolean): Option[MapStatus]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              stop is part of the ShuffleWriter abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              stop turns the stopping flag on and returns the internal mapStatus if the input success is enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Otherwise, when stopping flag is already enabled or the input success is disabled, stop returns no MapStatus (i.e. None).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, stop requests the ExternalSorter to stop and increments the shuffle write time task metrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/SortShuffleWriter/#requirements-of-bypassmergesortshufflehandle-as-shufflehandle","title":"Requirements of BypassMergeSortShuffleHandle (as ShuffleHandle)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              shouldBypassMergeSort(\n  conf: SparkConf,\n  dep: ShuffleDependency[_, _, _]): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              shouldBypassMergeSort returns true when all of the following hold:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1. No map-side aggregation (the mapSideCombine flag of the given ShuffleDependency is off)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              2. Number of partitions (of the Partitioner of the given ShuffleDependency) is not greater than spark.shuffle.sort.bypassMergeThreshold configuration property

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Otherwise, shouldBypassMergeSort does not hold (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              shouldBypassMergeSort is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SortShuffleManager is requested to register a shuffle (and creates a ShuffleHandle)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/SortShuffleWriter/#stopping-flag","title":"stopping Flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SortShuffleWriter uses stopping internal flag to indicate whether or not this SortShuffleWriter has been stopped.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/SortShuffleWriter/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Enable ALL logging level for org.apache.spark.shuffle.sort.SortShuffleWriter logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              log4j.logger.org.apache.spark.shuffle.sort.SortShuffleWriter=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/Spillable/","title":"Spillable","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Spillable is an extension of the MemoryConsumer abstraction for spillable collections that can spill to disk.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Spillable[C] is a parameterized type of C combiner (partial) values.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/Spillable/#contract","title":"Contract","text":""},{"location":"shuffle/Spillable/#forcespill","title":"forceSpill
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              forceSpill(): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Force spilling the current in-memory collection to disk to release memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when Spillable is requested to spill

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/Spillable/#spill","title":"spill
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              spill(\n  collection: C): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Spills the current in-memory collection to disk, and releases the memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExternalAppendOnlyMap is requested to forceSpill
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Spillable is requested to spilling to disk if necessary
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/Spillable/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExternalAppendOnlyMap
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExternalSorter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"shuffle/Spillable/#memory-threshold","title":"Memory Threshold

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Spillable uses a threshold for the memory size (in bytes) to know when to spill to disk.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              When the size of the in-memory collection is above the threshold, Spillable will try to acquire more memory. Unless given all requested memory, Spillable spills to disk.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The memory threshold starts as spark.shuffle.spill.initialMemoryThreshold configuration property and is increased every time Spillable is requested to spill to disk if needed, but managed to acquire required memory. The threshold goes back to the initial value when requested to release all memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when Spillable is requested to spill and releaseMemory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"shuffle/Spillable/#creating-instance","title":"Creating Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Spillable takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • TaskMemoryManager Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Spillable is an abstract class and cannot be created directly. It is created indirectly for the concrete Spillables.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/Spillable/#configuration-properties","title":"Configuration Properties","text":""},{"location":"shuffle/Spillable/#sparkshufflespillnumelementsforcespillthreshold","title":"spark.shuffle.spill.numElementsForceSpillThreshold

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Spillable uses spark.shuffle.spill.numElementsForceSpillThreshold configuration property to force spilling in-memory objects to disk when requested to maybeSpill.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/Spillable/#sparkshufflespillinitialmemorythreshold","title":"spark.shuffle.spill.initialMemoryThreshold

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Spillable uses spark.shuffle.spill.initialMemoryThreshold configuration property as the initial threshold for the size of a collection (and the minimum memory required to operate properly).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Spillable uses it when requested to spill and releaseMemory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/Spillable/#releasing-all-memory","title":"Releasing All Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                releaseMemory(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                releaseMemory...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                releaseMemory is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExternalAppendOnlyMap is requested to freeCurrentMap
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExternalSorter is requested to stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Spillable is requested to maybeSpill and spill (and spilled to disk in either case)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/Spillable/#spilling-in-memory-collection-to-disk-to-release-memory","title":"Spilling In-Memory Collection to Disk (to Release Memory)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                spill(\n  collection: C): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                spill spills the given in-memory collection to disk to release memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                spill is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExternalAppendOnlyMap is requested to forceSpill
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Spillable is requested to maybeSpill
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/Spillable/#forcespill_1","title":"forceSpill
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                forceSpill(): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                forceSpill forcefully spills the Spillable to disk to release memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                forceSpill is used when Spillable is requested to spill an in-memory collection to disk.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/Spillable/#spilling-to-disk-if-necessary","title":"Spilling to Disk if Necessary
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                maybeSpill(\n  collection: C,\n  currentMemory: Long): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                maybeSpill...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                maybeSpill is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExternalAppendOnlyMap is requested to insertAll
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExternalSorter is requested to attempt to spill an in-memory collection to disk if needed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"shuffle/UnsafeShuffleWriter/","title":"UnsafeShuffleWriter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                UnsafeShuffleWriter<K, V> is a ShuffleWriter for SerializedShuffleHandles.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                UnsafeShuffleWriter opens resources (a ShuffleExternalSorter and the buffers) while being created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"shuffle/UnsafeShuffleWriter/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                UnsafeShuffleWriter takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • TaskMemoryManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SerializedShuffleHandle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Map ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • TaskContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ShuffleWriteMetricsReporter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ShuffleExecutorComponents

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  UnsafeShuffleWriter is created when SortShuffleManager is requested for a ShuffleWriter for a SerializedShuffleHandle.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  UnsafeShuffleWriter makes sure that the number of partitions at most 16MB reduce partitions (1 << 24) (as the upper bound of the partition identifiers that can be encoded) or throws an IllegalArgumentException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  UnsafeShuffleWriter can only be used for shuffles with at most 16777215 reduce partitions\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  UnsafeShuffleWriter uses the number of partitions of the Partitioner that is used for the ShuffleDependency of the SerializedShuffleHandle.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The number of shuffle output partitions is first enforced when SortShuffleManager is requested to check out whether a SerializedShuffleHandle can be used for ShuffleHandle (that eventually leads to UnsafeShuffleWriter).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, UnsafeShuffleWriter creates a ShuffleExternalSorter and a SerializationStream.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"shuffle/UnsafeShuffleWriter/#shuffleexternalsorter","title":"ShuffleExternalSorter

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  UnsafeShuffleWriter uses a ShuffleExternalSorter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ShuffleExternalSorter is created when UnsafeShuffleWriter is requested to open (while being created) and dereferenced (nulled) when requested to close internal resources and merge spill files.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when UnsafeShuffleWriter is requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Updating peak memory used
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Writing records
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Closing internal resources and merging spill files
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Inserting a record
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Stopping
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#indexshuffleblockresolver","title":"IndexShuffleBlockResolver

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  UnsafeShuffleWriter is given a IndexShuffleBlockResolver when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  UnsafeShuffleWriter uses the IndexShuffleBlockResolver for...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#initial-serialized-buffer-size","title":"Initial Serialized Buffer Size

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  UnsafeShuffleWriter uses a fixed buffer size for the output stream of serialized data written into a byte array (default: 1024 * 1024).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#inputbuffersizeinbytes","title":"inputBufferSizeInBytes

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  UnsafeShuffleWriter uses the spark.shuffle.file.buffer configuration property for...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#outputbuffersizeinbytes","title":"outputBufferSizeInBytes

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  UnsafeShuffleWriter uses the spark.shuffle.unsafe.file.output.buffer configuration property for...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#transfertoenabled","title":"transferToEnabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  UnsafeShuffleWriter can use a specialized NIO-based fast merge procedure that avoids extra serialization/deserialization when spark.file.transferTo configuration property is enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#initialsortbuffersize","title":"initialSortBufferSize

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  UnsafeShuffleWriter uses the initial buffer size for sorting (default: 4096) when creating a ShuffleExternalSorter (when requested to open).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Tip

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Use spark.shuffle.sort.initialBufferSize configuration property to change the buffer size.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#merging-spills","title":"Merging Spills
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  long[] mergeSpills(\n  SpillInfo[] spills,\n  File outputFile)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#many-spills","title":"Many Spills

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  With multiple SpillInfos to merge, mergeSpills selects between fast and slow merge strategies. The fast merge strategy can be transferTo- or fileStream-based.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  mergeSpills uses the spark.shuffle.unsafe.fastMergeEnabled configuration property to consider one of the fast merge strategies.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  A fast merge strategy is supported when spark.shuffle.compress configuration property is disabled or the IO compression codec supports decompression of concatenated compressed streams.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  With spark.shuffle.compress configuration property enabled, mergeSpills will always use the slow merge strategy.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  With fast merge strategy enabled and supported, transferToEnabled enabled and encryption disabled, mergeSpills prints out the following DEBUG message to the logs and mergeSpillsWithTransferTo.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Using transferTo-based fast merge\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  With fast merge strategy enabled and supported, no transferToEnabled or encryption enabled, mergeSpills prints out the following DEBUG message to the logs and mergeSpillsWithFileStream (with no compression codec).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Using fileStream-based fast merge\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  For slow merge, mergeSpills prints out the following DEBUG message to the logs and mergeSpillsWithFileStream (with the compression codec).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Using slow merge\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, mergeSpills requests the ShuffleWriteMetrics to decBytesWritten and incBytesWritten, and returns the partition length array.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#one-spill","title":"One Spill

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  With one SpillInfo to merge, mergeSpills simply renames the spill file to be the output file and returns the partition length array of the one spill.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#no-spills","title":"No Spills

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  With no SpillInfos to merge, mergeSpills creates an empty output file and returns an array of 0s of size of the numPartitions of the Partitioner.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#usage","title":"Usage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  mergeSpills is used when UnsafeShuffleWriter is requested to close internal resources and merge spill files.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#mergespillswithtransferto","title":"mergeSpillsWithTransferTo
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  long[] mergeSpillsWithTransferTo(\n  SpillInfo[] spills,\n  File outputFile)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  mergeSpillsWithTransferTo...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  mergeSpillsWithTransferTo is used when UnsafeShuffleWriter is requested to mergeSpills (with the transferToEnabled flag enabled and no encryption).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[updatePeakMemoryUsed]] updatePeakMemoryUsed Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java","title":"[source, java]","text":""},{"location":"shuffle/UnsafeShuffleWriter/#void-updatepeakmemoryused","title":"void updatePeakMemoryUsed()

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  updatePeakMemoryUsed...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  updatePeakMemoryUsed is used when UnsafeShuffleWriter is requested for the <> and to <>.","text":""},{"location":"shuffle/UnsafeShuffleWriter/#writing-key-value-records-of-partition","title":"Writing Key-Value Records of Partition

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  void write(\n  Iterator<Product2<K, V>> records)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  write traverses the input sequence of records (for a RDD partition) and insertRecordIntoSorter one by one. When all the records have been processed, write closes internal resources and merges spill files.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, write requests ShuffleExternalSorter to clean up.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When requested to <>, UnsafeShuffleWriter simply <> followed by <> (that, among other things, creates the <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  write is part of the ShuffleWriter abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[stop]] Stopping ShuffleWriter

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java_1","title":"[source, java]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Option stop( boolean success)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  stop...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When requested to <>, UnsafeShuffleWriter records the peak execution memory metric and returns the <> (that was created when requested to <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  stop is part of the ShuffleWriter abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[insertRecordIntoSorter]] Inserting Record Into ShuffleExternalSorter

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java_2","title":"[source, java]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  void insertRecordIntoSorter( Product2 record)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  insertRecordIntoSorter requires that the <> is available.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  insertRecordIntoSorter requests the <> to reset (so that all currently accumulated output in the output stream is discarded and reusing the already allocated buffer space).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  insertRecordIntoSorter requests the <> to write out the record (write the serializer:SerializationStream.md#writeKey[key] and the serializer:SerializationStream.md#writeValue[value]) and to serializer:SerializationStream.md#flush[flush].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[insertRecordIntoSorter-serializedRecordSize]] insertRecordIntoSorter requests the <> for the length of the buffer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[insertRecordIntoSorter-partitionId]] insertRecordIntoSorter requests the <> for the ../rdd/Partitioner.md#getPartition[partition] for the given record (by the key).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, insertRecordIntoSorter requests the <> to ShuffleExternalSorter.md#insertRecord[insert] the <> as a byte array (with the <> and the <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  insertRecordIntoSorter is used when UnsafeShuffleWriter is requested to <>.","text":""},{"location":"shuffle/UnsafeShuffleWriter/#closing-and-writing-output-merging-spill-files","title":"Closing and Writing Output (Merging Spill Files)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  void closeAndWriteOutput()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  closeAndWriteOutput asserts that the ShuffleExternalSorter is created (non-null).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  closeAndWriteOutput updates peak memory used.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  closeAndWriteOutput removes the references to the <> and <> output streams (nulls them).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  closeAndWriteOutput requests the <> to ShuffleExternalSorter.md#closeAndGetSpills[close and return spill metadata].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  closeAndWriteOutput removes the reference to the <> (nulls it).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  closeAndWriteOutput requests the <> for the IndexShuffleBlockResolver.md#getDataFile[output data file] for the <> and <> IDs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[closeAndWriteOutput-partitionLengths]][[closeAndWriteOutput-tmp]] closeAndWriteOutput creates a temporary file (along the data output file) and uses it to <> (that gives a partition length array). All spill files are then deleted.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  closeAndWriteOutput requests the <> to IndexShuffleBlockResolver.md#writeIndexFileAndCommit[write shuffle index and data files] (for the <> and <> IDs, the <> and the <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, closeAndWriteOutput creates a scheduler:MapStatus.md[MapStatus] with the storage:BlockManager.md#shuffleServerId[location of the local BlockManager] and the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  closeAndWriteOutput prints out the following ERROR message to the logs if there is an issue with deleting spill files:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Error while deleting spill file [path]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  closeAndWriteOutput prints out the following ERROR message to the logs if there is an issue with deleting the <>:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Error while deleting temp file [path]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  closeAndWriteOutput is used when UnsafeShuffleWriter is requested to write records.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[getPeakMemoryUsedBytes]] Getting Peak Memory Used

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java_3","title":"[source, java]","text":""},{"location":"shuffle/UnsafeShuffleWriter/#long-getpeakmemoryusedbytes","title":"long getPeakMemoryUsedBytes()

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getPeakMemoryUsedBytes simply <> and returns the internal <> registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getPeakMemoryUsedBytes is used when UnsafeShuffleWriter is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[open]] Opening UnsafeShuffleWriter and Buffers

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java_4","title":"[source, java]","text":""},{"location":"shuffle/UnsafeShuffleWriter/#void-open","title":"void open()

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  open requires that there is no <> available.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  open creates a ShuffleExternalSorter.md[ShuffleExternalSorter].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  open creates a <> with the capacity of <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  open requests the <> for a serializer:SerializerInstance.md#serializeStream[SerializationStream] to the <> (available internally as the <> reference).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  open is used when UnsafeShuffleWriter is <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[logging]] Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Enable ALL logging level for org.apache.spark.shuffle.sort.UnsafeShuffleWriter logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"shuffle/UnsafeShuffleWriter/#log4jloggerorgapachesparkshufflesortunsafeshufflewriterall","title":"log4j.logger.org.apache.spark.shuffle.sort.UnsafeShuffleWriter=ALL

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Refer to spark-logging.md[Logging].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"shuffle/UnsafeShuffleWriter/#internal-properties","title":"Internal Properties","text":""},{"location":"shuffle/UnsafeShuffleWriter/#mapstatus","title":"MapStatus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  MapStatus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Created when UnsafeShuffleWriter is requested to <> (with the storage:BlockManagerId.md[] of the <> and partitionLengths)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Returned when UnsafeShuffleWriter is requested to <>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#partitioner","title":"Partitioner

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Partitioner (as used by the BaseShuffleHandle.md#dependency[ShuffleDependency] of the <>)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Used when UnsafeShuffleWriter is requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • <> (and create a ShuffleExternalSorter.md[ShuffleExternalSorter] with the given ../rdd/Partitioner.md#numPartitions[number of partitions])

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • <> (and request the ../rdd/Partitioner.md#getPartition[partition for the key])

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • <>, <> and <> (for the ../rdd/Partitioner.md#numPartitions[number of partitions] to create partition lengths)","text":""},{"location":"shuffle/UnsafeShuffleWriter/#peak-memory-used","title":"Peak Memory Used

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Peak memory used (in bytes) that is updated exclusively in <> (after requesting the <> for ShuffleExternalSorter.md#getPeakMemoryUsedBytes[getPeakMemoryUsedBytes])

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Use <> to access the current value","text":""},{"location":"shuffle/UnsafeShuffleWriter/#bytearrayoutputstream-for-serialized-data","title":"ByteArrayOutputStream for Serialized Data

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    {java-javadoc-url}/java/io/ByteArrayOutputStream.html[java.io.ByteArrayOutputStream] of serialized data (written into a byte array of <> initial size)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used when UnsafeShuffleWriter is requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • <> (and create the internal <>)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Destroyed (null) when requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      === [[serializer]] serializer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      serializer:SerializerInstance.md[SerializerInstance] (that is a new instance of the Serializer of the BaseShuffleHandle.md#dependency[ShuffleDependency] of the <>)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Used exclusively when UnsafeShuffleWriter is requested to <> (and creates the <>)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      === [[serOutputStream]] serOutputStream

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      serializer:SerializationStream.md[SerializationStream] (that is created when the <> is requested to serializer:SerializerInstance.md#serializeStream[serializeStream] with the <>)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Used when UnsafeShuffleWriter is requested to <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Destroyed (null) when requested to <>.","text":""},{"location":"shuffle/UnsafeShuffleWriter/#shuffle-id","title":"Shuffle ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Shuffle ID (of the ShuffleDependency of the SerializedShuffleHandle)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Used exclusively when requested to <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      === [[writeMetrics]] writeMetrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      executor:ShuffleWriteMetrics.md[] (of the TaskMetrics of the <>)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Used when UnsafeShuffleWriter is requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • <> (and creates the <>)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • <>","text":""},{"location":"stage-level-scheduling/","title":"Stage-Level Scheduling","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Stage-Level Scheduling uses ResourceProfiles for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Spark developers can specify task and executor resource requirements at stage level
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Spark (Scheduler) uses the stage-level requirements to acquire the necessary resources and executors and schedule tasks based on the per-stage requirements

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Apache Spark 3.1.1

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Stage-Level Scheduling was introduced in Apache Spark 3.1.1 (cf. SPARK-27495)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/#resource-profiles","title":"Resource Profiles","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Resource Profiles are managed by ResourceProfileManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The Default ResourceProfile is known by ID 0.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Custom Resource Profiles are ResourceProfiles with non-0 IDs. Custom Resource Profiles are only supported on YARN, Kubernetes and Spark Standalone.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ResourceProfiles are associated with an RDD using withResources operator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/#resource-requests","title":"Resource Requests","text":""},{"location":"stage-level-scheduling/#executor","title":"Executor","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Executor Resource Requests are specified using executorResources of a ResourceProfile.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Executor Resource Requests can be the following built-in resources:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • cores
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • memoryOverhead
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • pyspark.memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • offHeap

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Other (deployment environment-specific) executor resource requests can be defined as Custom Executor Resources.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/#task","title":"Task","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Default Task Resources are specified based on spark.task.cpus and spark.task.resource-prefixed configuration properties.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/#sparklistenerresourceprofileadded","title":"SparkListenerResourceProfileAdded","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ResourceProfiles can be monitored using SparkListenerResourceProfileAdded.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/#dynamic-allocation","title":"Dynamic Allocation","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Dynamic Allocation of Executors is not supported.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/#demo","title":"Demo","text":""},{"location":"stage-level-scheduling/#describe-distributed-computation","title":"Describe Distributed Computation","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Let's describe a distributed computation (using RDD API) over a 10-record dataset.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        val rdd = sc.range(0, 9)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/#describe-required-resources","title":"Describe Required Resources","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Optional Step

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        This demo assumes to be executed in local deployment mode (that supports the default ResourceProfile only) and so the step is considered optional until a supported cluster manager is used.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        import org.apache.spark.resource.ResourceProfileBuilder\nval rpb = new ResourceProfileBuilder\nval rp1 = rpb.build()\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> println(rp1.toString)\nProfile: id = 1, executor resources: , task resources:\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/#configure-default-resourceprofile","title":"Configure Default ResourceProfile","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Use spark.task.resource-prefixed properties per ResourceUtils.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/#associate-required-resources-to-distributed-computation","title":"Associate Required Resources to Distributed Computation","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        rdd.withResources(rp1)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        scala> rdd.withResources(rp1)\norg.apache.spark.SparkException: TaskResourceProfiles are only supported for Standalone cluster for now when dynamic allocation is disabled.\n  at org.apache.spark.resource.ResourceProfileManager.isSupported(ResourceProfileManager.scala:71)\n  at org.apache.spark.resource.ResourceProfileManager.addResourceProfile(ResourceProfileManager.scala:126)\n  at org.apache.spark.rdd.RDD.withResources(RDD.scala:1802)\n  ... 42 elided\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        SPARK-43912

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Reported as SPARK-43912 Incorrect SparkException for Stage-Level Scheduling in local mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Until it is fixed, enable Dynamic Allocation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        $ ./bin/spark-shell -c spark.dynamicAllocation.enabled=true\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/ExecutorResourceInfo/","title":"ExecutorResourceInfo","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ExecutorResourceInfo is a ResourceAllocator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"stage-level-scheduling/ExecutorResourceInfo/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ExecutorResourceInfo takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Resource Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Addresses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Number of slots (per address)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ExecutorResourceInfo is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • DriverEndpoint is requested to handle a RegisterExecutor event
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"stage-level-scheduling/ExecutorResourceRequest/","title":"ExecutorResourceRequest","text":""},{"location":"stage-level-scheduling/ExecutorResourceRequest/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ExecutorResourceRequest takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Resource Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Amount
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Discovery Script
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Vendor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExecutorResourceRequest is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ExecutorResourceRequests is requested to memory, offHeapMemory, memoryOverhead, pysparkMemory, cores and resource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • JsonProtocol utility is used to executorResourceRequestFromJson
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/ExecutorResourceRequest/#serializable","title":"Serializable","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExecutorResourceRequest is a Serializable (Java).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/ExecutorResourceRequests/","title":"ExecutorResourceRequests","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExecutorResourceRequests is a set of ExecutorResourceRequests for Spark developers to (programmatically) specify resources for an RDD to be applied at stage level:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • cores
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • memoryOverhead
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • offHeap
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • pyspark.memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • custom resource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/ExecutorResourceRequests/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExecutorResourceRequests takes no arguments to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExecutorResourceRequests is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ResourceProfile utility is used to get the default executor resource requests (for tasks)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/ExecutorResourceRequests/#serializable","title":"Serializable","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExecutorResourceRequests is a Serializable (Java).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/ExecutorResourceRequests/#resource","title":"resource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            resource(\n  resourceName: String,\n  amount: Long,\n  discoveryScript: String = \"\",\n  vendor: String = \"\"): this.type\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            resource creates a ExecutorResourceRequest and registers it under resourceName.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            resource is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ResourceProfile utility is used for the default executor resources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"stage-level-scheduling/ExecutorResourceRequests/#text-representation","title":"Text Representation

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ExecutorResourceRequests presents itself as:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Executor resource requests: [_executorResources]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"stage-level-scheduling/ExecutorResourceRequests/#demo","title":"Demo
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            import org.apache.spark.resource.ExecutorResourceRequests\nval executorResources = new ExecutorResourceRequests()\n  .memory(\"2g\")\n  .memoryOverhead(\"512m\")\n  .cores(8)\n  .resource(\n    resourceName = \"my-custom-resource\",\n    amount = 1,\n    discoveryScript = \"/this/is/path/to/discovery/script.sh\",\n    vendor = \"pl.japila\")\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            scala> println(executorResources)\nExecutor resource requests: {memoryOverhead=name: memoryOverhead, amount: 512, script: , vendor: , memory=name: memory, amount: 2048, script: , vendor: , cores=name: cores, amount: 8, script: , vendor: , my-custom-resource=name: my-custom-resource, amount: 1, script: /this/is/path/to/discovery/script.sh, vendor: pl.japila}\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"stage-level-scheduling/ResourceAllocator/","title":"ResourceAllocator","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ResourceAllocator is an abstraction of resource allocators.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/ResourceAllocator/#contract","title":"Contract","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#resourceaddresses","title":"resourceAddresses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            resourceAddresses: Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ResourceAllocator is requested for the addressAvailabilityMap
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#resourcename","title":"resourceName
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            resourceName: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ResourceAllocator is requested to acquire and release addresses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#slotsperaddress","title":"slotsPerAddress
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            slotsPerAddress: Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ResourceAllocator is requested for the addressAvailabilityMap, assignedAddrs and to release
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ExecutorResourceInfo
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • WorkerResourceInfo (Spark Standalone)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/ResourceAllocator/#acquiring-addresses","title":"Acquiring Addresses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            acquire(\n  addrs: Seq[String]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            acquire...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            acquire is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DriverEndpoint is requested to launchTasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • WorkerResourceInfo (Spark Standalone) is requested to acquire and recoverResources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#releasing-addresses","title":"Releasing Addresses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            release(\n  addrs: Seq[String]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            release...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            release is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DriverEndpoint is requested to handle a StatusUpdate event
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • WorkerInfo (Spark Standalone) is requested to releaseResources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#assignedaddrs","title":"assignedAddrs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            assignedAddrs: Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            assignedAddrs...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            assignedAddrs is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • WorkerInfo (Spark Standalone) is requested for the resourcesInfoUsed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#availableaddrs","title":"availableAddrs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            availableAddrs: Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            availableAddrs...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            availableAddrs is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • WorkerInfo (Spark Standalone) is requested for the resourcesInfoFree
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • WorkerResourceInfo (Spark Standalone) is requested to acquire and resourcesAmountFree
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DriverEndpoint is requested to makeOffers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#addressavailabilitymap","title":"addressAvailabilityMap
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            addressAvailabilityMap: Seq[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            addressAvailabilityMap...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Lazy Value

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            addressAvailabilityMap is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Learn more in the Scala Language Specification.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            addressAvailabilityMap is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ResourceAllocator is requested to availableAddrs, assignedAddrs, acquire, release
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"stage-level-scheduling/ResourceID/","title":"ResourceID","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ResourceID is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","tags":["DeveloperApi"]},{"location":"stage-level-scheduling/ResourceProfile/","title":"ResourceProfile","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ResourceProfile is a resource profile that describes executor and task requirements of an RDD in Stage-Level Scheduling.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ResourceProfile can be associated with an RDD using RDD.withResources method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The ResourceProfile of an RDD is available using RDD.getResourceProfile method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"stage-level-scheduling/ResourceProfile/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ResourceProfile takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Executor Resources (Map[String, ExecutorResourceRequest])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Task Resources (Map[String, TaskResourceRequest])

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ResourceProfile is created (directly or using getOrCreateDefaultProfile)\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • DriverEndpoint is requested to handle a RetrieveSparkAppConfig message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ResourceProfileBuilder utility is requested to build
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"stage-level-scheduling/ResourceProfile/#allSupportedExecutorResources","title":"Built-In Executor Resources","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ResourceProfile defines the following names as the Supported Executor Resources (among the specified executorResources):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • cores
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • memoryOverhead
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • pyspark.memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • offHeap

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              All other executor resources (names) are considered Custom Executor Resources.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"stage-level-scheduling/ResourceProfile/#getCustomExecutorResources","title":"Custom Executor Resources","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getCustomExecutorResources(): Map[String, ExecutorResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getCustomExecutorResources is the Executor Resources that are not supported executor resources.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getCustomExecutorResources is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ApplicationDescription is requested to resourceReqsPerExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ApplicationInfo is requested to createResourceDescForResourceProfile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ResourceProfile is requested to calculateTasksAndLimitingResource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ResourceUtils is requested to getOrDiscoverAllResourcesForResourceProfile, warnOnWastedResources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"stage-level-scheduling/ResourceProfile/#limitingResource","title":"Limiting Resource","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              limitingResource(\n  sparkConf: SparkConf): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              limitingResource takes the _limitingResource, if calculated earlier, or calculateTasksAndLimitingResource.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              limitingResource is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ResourceProfileManager is requested to add a new ResourceProfile (to recompute a limiting resource eagerly)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ResourceUtils is requested to warnOnWastedResources (for reporting purposes only)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"stage-level-scheduling/ResourceProfile/#_limitingResource","title":"_limitingResource","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              _limitingResource: Option[String] = None\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ResourceProfile defines _limitingResource variable that is determined (if there is one) while calculateTasksAndLimitingResource.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              _limitingResource can be the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • A \"special\" empty resource identifier (that is assumed cpus in TaskSchedulerImpl)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • cpus built-in task resource identifier
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • any custom resource identifier
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"stage-level-scheduling/ResourceProfile/#defaultProfile","title":"Default Profile","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ResourceProfile (Scala object) defines defaultProfile internal registry for the default ResourceProfile (per JVM instance).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              defaultProfile is undefined (None) and gets a new ResourceProfile when first requested.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              defaultProfile can be accessed using getOrCreateDefaultProfile.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              defaultProfile is cleared (removed) in clearDefaultProfile.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"stage-level-scheduling/ResourceProfile/#getOrCreateDefaultProfile","title":"getOrCreateDefaultProfile","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getOrCreateDefaultProfile(\n  conf: SparkConf): ResourceProfile\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getOrCreateDefaultProfile returns the default profile (if already defined) or creates a new one.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Unless defined, getOrCreateDefaultProfile creates a ResourceProfile with the default task and executor resource descriptions and makes it the defaultProfile.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getOrCreateDefaultProfile prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default ResourceProfile created,\nexecutor resources: [executorResources], task resources: [taskResources]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getOrCreateDefaultProfile\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • TaskResourceProfile is requested to getCustomExecutorResources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ResourceProfile is requested to getDefaultProfileExecutorResources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ResourceProfileManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • YarnAllocator (Spark on YARN) is requested to initDefaultProfile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"stage-level-scheduling/ResourceProfile/#getDefaultExecutorResources","title":"Default Executor Resource Requests","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getDefaultExecutorResources(\n  conf: SparkConf): Map[String, ExecutorResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getDefaultExecutorResources creates an ExecutorResourceRequests with the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Property Configuration Property cores spark.executor.cores memory spark.executor.memory memoryOverhead spark.executor.memoryOverhead pysparkMemory spark.executor.pyspark.memory offHeapMemory spark.memory.offHeap.size

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getDefaultExecutorResources finds executor resource requests (with the spark.executor component name in the given SparkConf) for ExecutorResourceRequests.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getDefaultExecutorResources initializes the defaultProfileExecutorResources (with the executor resource requests).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, getDefaultExecutorResources requests the ExecutorResourceRequests for all the resource requests

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"stage-level-scheduling/ResourceProfile/#getDefaultTaskResources","title":"Default Task Resource Requests","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getDefaultTaskResources(\n  conf: SparkConf): Map[String, TaskResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getDefaultTaskResources creates a new TaskResourceRequests with the cpus based on spark.task.cpus configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getDefaultTaskResources adds task resource requests (configured in the given SparkConf using spark.task.resource-prefixed properties).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, getDefaultTaskResources requests the TaskResourceRequests for the requests.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"stage-level-scheduling/ResourceProfile/#getresourcesforclustermanager","title":"getResourcesForClusterManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getResourcesForClusterManager(\n  rpId: Int,\n  execResources: Map[String, ExecutorResourceRequest],\n  overheadFactor: Double,\n  conf: SparkConf,\n  isPythonApp: Boolean,\n  resourceMappings: Map[String, String]): ExecutorResourcesOrDefaults\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getResourcesForClusterManager takes the DefaultProfileExecutorResources.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getResourcesForClusterManager calculates the overhead memory with the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • memoryOverheadMiB and executorMemoryMiB of the DefaultProfileExecutorResources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Given overheadFactor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              If the given rpId resource profile ID is not the default ID (0), getResourcesForClusterManager...FIXME (there is so much to \"digest\")

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getResourcesForClusterManager...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, getResourcesForClusterManager creates a ExecutorResourcesOrDefaults.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getResourcesForClusterManager is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BasicExecutorFeatureStep (Spark on Kubernetes) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • YarnAllocator (Spark on YARN) is requested to createYarnResourceForResourceProfile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"stage-level-scheduling/ResourceProfile/#getDefaultProfileExecutorResources","title":"getDefaultProfileExecutorResources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getDefaultProfileExecutorResources(\n  conf: SparkConf): DefaultProfileExecutorResources\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getDefaultProfileExecutorResources...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getDefaultProfileExecutorResources is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ResourceProfile is requested to getResourcesForClusterManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • YarnAllocator (Spark on YARN) is requested to runAllocatedContainers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"stage-level-scheduling/ResourceProfile/#serializable","title":"Serializable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ResourceProfile is a Java Serializable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"stage-level-scheduling/ResourceProfile/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Enable ALL logging level for org.apache.spark.resource.ResourceProfile logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              logger.ResourceProfile.name = org.apache.spark.resource.ResourceProfile\nlogger.ResourceProfile.level = all\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"stage-level-scheduling/ResourceProfileBuilder/","title":"ResourceProfileBuilder","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ResourceProfileBuilder is a fluent API for Spark developers to build ResourceProfiles (to associate with an RDD).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Available in Scala and Python APIs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ResourceProfileBuilder is available in Scala and Python APIs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"stage-level-scheduling/ResourceProfileBuilder/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ResourceProfileBuilder takes no arguments to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"stage-level-scheduling/ResourceProfileBuilder/#build","title":"Building ResourceProfile","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              build: ResourceProfile\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              build creates a ResourceProfile:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • TaskResourceProfile when _executorResources are undefined
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ResourceProfile with the executorResources and the taskResources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"stage-level-scheduling/ResourceProfileBuilder/#executorResources","title":"Executor Resources","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              executorResources: Map[String, ExecutorResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              executorResources...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"stage-level-scheduling/ResourceProfileBuilder/#taskResources","title":"Task Resources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              taskResources: Map[String, TaskResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              taskResources is TaskResourceRequests specified by users (by their resource names)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              taskResources are specified using require method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              taskResources can be removed using clearTaskResourceRequests method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              taskResources can be printed out using toString method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              taskResources is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ResourceProfileBuilder is requested to build a ResourceProfile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"stage-level-scheduling/ResourceProfileBuilder/#demo","title":"Demo","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              import org.apache.spark.resource.ResourceProfileBuilder\nval rp1 = new ResourceProfileBuilder()\n\nimport org.apache.spark.resource.ExecutorResourceRequests\nval execReqs = new ExecutorResourceRequests().cores(4).resource(\"gpu\", 4)\n\nimport org.apache.spark.resource.ExecutorResourceRequests\nval taskReqs = new TaskResourceRequests().cpus(1).resource(\"gpu\", 1)\n\nrp1.require(execReqs).require(taskReqs)\nval rprof1 = rp1.build\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              val rpManager = sc.resourceProfileManager // (1)!\nrpManager.addResourceProfile(rprof1)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1. NOTE: resourceProfileManager is private[spark]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"stage-level-scheduling/ResourceProfileManager/","title":"ResourceProfileManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ResourceProfileManager manages ResourceProfiles.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"stage-level-scheduling/ResourceProfileManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ResourceProfileManager takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • LiveListenerBus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ResourceProfileManager is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceProfileManager/#accessing-resourceprofilemanager","title":"Accessing ResourceProfileManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ResourceProfileManager is available to other Spark services using SparkContext.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceProfileManager/#resourceProfileIdToResourceProfile","title":"Registered ResourceProfiles","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                resourceProfileIdToResourceProfile: HashMap[Int, ResourceProfile]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ResourceProfileManager creates resourceProfileIdToResourceProfile registry of ResourceProfiles by their ID.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                A new ResourceProfile is added when addResourceProfile.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ResourceProfiles are resolved (looked up) using resourceProfileFromId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ResourceProfiles can be equivalent when they specify the same resources.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                resourceProfileIdToResourceProfile is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • canBeScheduled
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceProfileManager/#defaultProfile","title":"Default ResourceProfile","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ResourceProfileManager gets or creates the default ResourceProfile when created and registers it immediately.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The default profile is available as defaultResourceProfile.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceProfileManager/#defaultResourceProfile","title":"Accessing Default ResourceProfile","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                defaultResourceProfile: ResourceProfile\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                defaultResourceProfile returns the default ResourceProfile.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                defaultResourceProfile is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ExecutorAllocationManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkContext is requested to requestTotalExecutors and createTaskScheduler
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • DAGScheduler is requested to mergeResourceProfilesForStage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • CoarseGrainedSchedulerBackend is requested to requestExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • StandaloneSchedulerBackend (Spark Standalone) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • KubernetesClusterSchedulerBackend (Spark on Kubernetes) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MesosCoarseGrainedSchedulerBackend (Spark on Mesos) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceProfileManager/#addResourceProfile","title":"Registering ResourceProfile","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                addResourceProfile(\n  rp: ResourceProfile): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                addResourceProfile checks if the given ResourceProfile is supported.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                addResourceProfile registers the given ResourceProfile (in the resourceProfileIdToResourceProfile registry) unless done earlier (by ResourceProfile ID).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                With a new ResourceProfile, addResourceProfile requests the given ResourceProfile for the limiting resource (for no reason but to calculate it upfront) and prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Added ResourceProfile id: [id]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In the end (for a new ResourceProfile), addResourceProfile requests the LiveListenerBus to post a SparkListenerResourceProfileAdded.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                addResourceProfile is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • RDD.withResources operator is used
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ResourceProfileManager is created (and registers the default profile)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • DAGScheduler is requested to mergeResourceProfilesForStage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceProfileManager/#dynamicEnabled","title":"Dynamic Allocation","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ResourceProfileManager initializes dynamicEnabled flag to be isDynamicAllocationEnabled when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                dynamicEnabled flag is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • isSupported
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • canBeScheduled
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceProfileManager/#isSupported","title":"isSupported","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                isSupported(\n  rp: ResourceProfile): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                isSupported...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceProfileManager/#canBeScheduled","title":"canBeScheduled","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                canBeScheduled(\n  taskRpId: Int,\n  executorRpId: Int): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                canBeScheduled asserts that the given taskRpId and executorRpId are valid ResourceProfile IDs or throws an AssertionError:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Tasks and executors must have valid resource profile id\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                canBeScheduled finds the ResourceProfile.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                canBeScheduled holds positive (true) when either holds:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                1. The given taskRpId and executorRpId are the same
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2. Dynamic Allocation is disabled and the ResourceProfile is a TaskResourceProfile

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                canBeScheduled is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • TaskSchedulerImpl is requested to resourceOfferSingleTaskSet and calculateAvailableSlots
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceProfileManager/#logging","title":"Logging","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Enable ALL logging level for org.apache.spark.resource.ResourceProfileManager logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Add the following line to conf/log4j2.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                logger.ResourceProfileManager.name = org.apache.spark.resource.ResourceProfileManager\nlogger.ResourceProfileManager.level = all\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceUtils/","title":"ResourceUtils","text":""},{"location":"stage-level-scheduling/ResourceUtils/#addTaskResourceRequests","title":"Registering Task Resource Requests (from SparkConf)","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                addTaskResourceRequests(\n  sparkConf: SparkConf,\n  treqs: TaskResourceRequests): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                addTaskResourceRequests registers all task resource requests in the given SparkConf with the given TaskResourceRequests.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                addTaskResourceRequests listResourceIds with spark.task component name in the given SparkConf.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For every ResourceID discovered, addTaskResourceRequests does the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                1. Finds all the settings with the confPrefix
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2. Looks up amount setting (or throws a SparkException)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                3. Registers the resourceName with the amount in the given TaskResourceRequests

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                addTaskResourceRequests is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ResourceProfile is requested for the default task resource requests
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceUtils/#listResourceIds","title":"Listing All Configured Resources","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                listResourceIds(\n  sparkConf: SparkConf,\n  componentName: String): Seq[ResourceID]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                listResourceIds requests the given SparkConf to find all Spark settings with the keys with the prefix of the following pattern:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                [componentName].resource.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Internals

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                listResourceIds gets resource-related settings (from SparkConf) with the prefix removed (e.g., spark.my_component.resource.gpu.amount becomes just gpu.amount).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Example
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                // Use the following to start spark-shell\n// ./bin/spark-shell -c spark.my_component.resource.gpu.amount=5\n\nval sparkConf = sc.getConf\n\n// Component names must start with `spark.` prefix\n// Spark assumes valid Spark settings start with `spark.` prefix\nval componentName = \"spark.my_component\"\n\n// this is copied verbatim from ResourceUtils.listResourceIds\n// Note that `resource` is hardcoded\nsparkConf.getAllWithPrefix(s\"$componentName.resource.\").foreach(println)\n\n// (gpu.amount,5)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                listResourceIds asserts that resource settings include a . (dot) to separate their resource names from configs or throws the following SparkException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                You must specify an amount config for resource: [key] config: [componentName].resource.[key]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                SPARK-43947

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Although the exception says You must specify an amount config for resource, only the dot is checked.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                // Use the following to start spark-shell\n// 1. No amount config specified\n// 2. spark.driver is a Spark built-in resource\n// ./bin/spark-shell -c spark.driver.resource.gpu=5\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Reported as SPARK-43947.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In the end, listResourceIds creates a ResourceID for every resource (with the givencomponentName and resource names discovered).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                listResourceIds is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ResourceUtils is requested to parseAllResourceRequests, addTaskResourceRequests, parseResourceRequirements, parseAllocatedOrDiscoverResources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceUtils/#parseAllResourceRequests","title":"parseAllResourceRequests","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                parseAllResourceRequests(\n  sparkConf: SparkConf,\n  componentName: String): Seq[ResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                parseAllResourceRequests...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When componentName ResourceProfile spark.executor ResourceUtils KubernetesUtils (Spark on Kubernetes)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                parseAllResourceRequests is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ResourceProfile is requested for the default executor resource requests
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ResourceUtils is requested to getOrDiscoverAllResources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • KubernetesUtils (Spark on Kubernetes) is requested to buildResourcesQuantities
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceUtils/#getOrDiscoverAllResources","title":"getOrDiscoverAllResources","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getOrDiscoverAllResources(\n  sparkConf: SparkConf,\n  componentName: String,\n  resourcesFileOpt: Option[String]): Map[String, ResourceInformation]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getOrDiscoverAllResources...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When componentName resourcesFileOpt SparkContext spark.driver spark.driver.resourcesFile Worker (Spark Standalone) spark.worker spark.worker.resourcesFile

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                getOrDiscoverAllResources is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkContext is created (and initializes _resources)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Worker (Spark Standalone) is requested to setupWorkerResources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceUtils/#parseAllocatedOrDiscoverResources","title":"parseAllocatedOrDiscoverResources","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                parseAllocatedOrDiscoverResources(\n  sparkConf: SparkConf,\n  componentName: String,\n  resourcesFileOpt: Option[String]): Seq[ResourceAllocation]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                parseAllocatedOrDiscoverResources...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/ResourceUtils/#parseResourceRequirements","title":"parseResourceRequirements (Spark Standalone)","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                parseResourceRequirements(\n  sparkConf: SparkConf,\n  componentName: String): Seq[ResourceRequirement]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                parseResourceRequirements...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                componentName

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                componentName seems to be always spark.driver for the use cases that seems to be Spark Standalone only.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                parseResourceRequirements is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ClientEndpoint (Spark Standalone) is requested to onStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • StandaloneSubmitRequestServlet (Spark Standalone) is requested to buildDriverDescription
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"stage-level-scheduling/SparkListenerResourceProfileAdded/","title":"SparkListenerResourceProfileAdded","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                SparkListenerResourceProfileAdded is a SparkListenerEvent.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                SparkListenerResourceProfileAdded can be intercepted using the following Spark listeners:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkFirehoseListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkListenerInterface
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkListener

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                SparkListenerResourceProfileAdded is recorded using AppStatusListener for status reporting and monitoring.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","tags":["DeveloperApi"]},{"location":"stage-level-scheduling/SparkListenerResourceProfileAdded/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                SparkListenerResourceProfileAdded takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ResourceProfile

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkListenerResourceProfileAdded is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ResourceProfileManager is requested to register a new ResourceProfile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • JsonProtocol (Spark History Server) is requested to resourceProfileAddedFromJson
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","tags":["DeveloperApi"]},{"location":"stage-level-scheduling/SparkListenerResourceProfileAdded/#spark-history-server","title":"Spark History Server","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkListenerResourceProfileAdded is logged in Spark History Server using EventLoggingListener.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  SparkListenerResourceProfileAdded is converted from and to JSON format using JsonProtocol (resourceProfileAddedFromJson and resourceProfileAddedToJson, respectively).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","tags":["DeveloperApi"]},{"location":"stage-level-scheduling/TaskResourceProfile/","title":"TaskResourceProfile","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  TaskResourceProfile is a ResourceProfile.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"stage-level-scheduling/TaskResourceProfile/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  TaskResourceProfile takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Task Resources

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskResourceProfile is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ResourceProfileBuilder is requested to build a ResourceProfile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • DAGScheduler is requested to merge ResourceProfiles
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/TaskResourceProfile/#getCustomExecutorResources","title":"getCustomExecutorResources","text":"ResourceProfile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getCustomExecutorResources(): Map[String, ExecutorResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getCustomExecutorResources is part of the ResourceProfile abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getCustomExecutorResources...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/TaskResourceRequest/","title":"TaskResourceRequest","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskResourceRequest is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/TaskResourceRequests/","title":"TaskResourceRequests","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskResourceRequests is a convenience API to work with TaskResourceRequests (and hence the name \ud83d\ude09).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskResourceRequests can be defined as required using ResourceProfileBuilder.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskResourceRequests can be specified using configuration properties (using spark.task prefix).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Resource Name Registerer cpus cpus user-defined name resource, addRequest"},{"location":"stage-level-scheduling/TaskResourceRequests/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskResourceRequests takes no arguments to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskResourceRequests is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ResourceProfile is requested for the default task resource requests
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/TaskResourceRequests/#serializable","title":"Serializable","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    TaskResourceRequests is Serializable (Java).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/TaskResourceRequests/#cpus","title":"cpus","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    cpus(\n  amount: Int): this.type\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    cpus registers a TaskResourceRequest with cpus resource name and the given amount (in the _taskResources registry) under the name cpus.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Fluent API

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    cpus is part of the fluent API of (and hence this strange-looking this.type return type).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    cpus is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ResourceProfile is requested for the default task resource requests
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/TaskResourceRequests/#_taskResources","title":"_taskResources","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    _taskResources: ConcurrentHashMap[String, TaskResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    _taskResources is a collection of TaskResourceRequests by their resource name.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    _taskResources is available as requests.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"stage-level-scheduling/TaskResourceRequests/#requests","title":"requests","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    requests: Map[String, TaskResourceRequest]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    requests returns the _taskResources (converted to Scala).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    requests is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ResourceProfile is requested for the default task resource requests
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ResourceProfileBuilder is requested to require
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • TaskResourceRequests is requested for the string representation
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"status/","title":"Status","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Status system uses AppStatusListener to write the state of a Spark application to AppStatusStore for reporting and monitoring:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • web UI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • REST API
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Spark History Server
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Metrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"status/AppStatusListener/","title":"AppStatusListener","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    AppStatusListener is a SparkListener that writes application state information to a data store.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"status/AppStatusListener/#event-handlers","title":"Event Handlers","text":"Event Handler LiveEntities onJobStart
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • LiveJob
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • LiveStage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • RDDOperationGraph onStageSubmitted"},{"location":"status/AppStatusListener/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      AppStatusListener takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ElementTrackingStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • live flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • AppStatusSource (default: None)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Last Update Time (default: None)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        AppStatusListener is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • AppStatusStore is requested for a in-memory store for a running Spark application (with the live flag enabled)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • FsHistoryProvider is requested to rebuildAppStore (with the live flag disabled)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"status/AppStatusListener/#elementtrackingstore","title":"ElementTrackingStore

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        AppStatusListener is given an ElementTrackingStore when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        AppStatusListener registers triggers to clean up state in the store:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • cleanupExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • cleanupJobs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • cleanupStages

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ElementTrackingStore is used to write and...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"status/AppStatusListener/#live-flag","title":"live Flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        AppStatusListener is given a live flag when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        live flag indicates whether AppStatusListener is created for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • true when created for a active (live) Spark application (for AppStatusStore)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • false when created for Spark History Server (for FsHistoryProvider)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"status/AppStatusListener/#updating-elementtrackingstore-for-active-spark-application","title":"Updating ElementTrackingStore for Active Spark Application
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        liveUpdate(\n  entity: LiveEntity,\n  now: Long): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        liveUpdate update the ElementTrackingStore when the live flag is enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"status/AppStatusListener/#updating-elementtrackingstore","title":"Updating ElementTrackingStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        update(\n  entity: LiveEntity,\n  now: Long,\n  last: Boolean = false): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        update requests the given LiveEntity to write (with the ElementTrackingStore and checkTriggers flag being the given last flag).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"status/AppStatusListener/#getorcreateexecutor","title":"getOrCreateExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getOrCreateExecutor(\n  executorId: String,\n  addTime: Long): LiveExecutor\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getOrCreateExecutor...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getOrCreateExecutor\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • AppStatusListener is requested to onExecutorAdded and onBlockManagerAdded
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"status/AppStatusListener/#getorcreatestage","title":"getOrCreateStage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getOrCreateStage(\n  info: StageInfo): LiveStage\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getOrCreateStage...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getOrCreateStage\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • AppStatusListener is requested to onJobStart and onStageSubmitted
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"status/AppStatusSource/","title":"AppStatusSource","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        AppStatusSource is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"status/AppStatusStore/","title":"AppStatusStore","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        AppStatusStore stores the state of a Spark application in a data store (listening to state changes using AppStatusListener).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"status/AppStatusStore/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        AppStatusStore takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • KVStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • AppStatusListener

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          AppStatusStore is created\u00a0using createLiveStore utility.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"status/AppStatusStore/#creating-in-memory-store-for-live-spark-application","title":"Creating In-Memory Store for Live Spark Application
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          createLiveStore(\n  conf: SparkConf,\n  appStatusSource: Option[AppStatusSource] = None): AppStatusStore\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          createLiveStore creates an ElementTrackingStore (with InMemoryStore and the SparkConf).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          createLiveStore creates an AppStatusListener (with the ElementTrackingStore, live flag on and the AppStatusSource).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In the end, creates an AppStatusStore (with the ElementTrackingStore and AppStatusListener).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          createLiveStore\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"status/AppStatusStore/#accessing-appstatusstore","title":"Accessing AppStatusStore

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          AppStatusStore is available using SparkContext.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"status/AppStatusStore/#sparkstatustracker","title":"SparkStatusTracker

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          AppStatusStore is used to create SparkStatusTracker.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"status/AppStatusStore/#sparkui","title":"SparkUI

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          AppStatusStore is used to create SparkUI.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"status/AppStatusStore/#rdds","title":"RDDs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          rddList(\n  cachedOnly: Boolean = true): Seq[v1.RDDStorageInfo]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          rddList requests the KVStore for (a view over) RDDStorageInfos (cached or not based on the given cachedOnly flag).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          rddList\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • AbstractApplicationResource is requested for the RDDs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • StageTableBase is created (and renders a stage table for AllStagesPage, JobPage and PoolPage)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • StoragePage is requested to render
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"status/AppStatusStore/#streaming-blocks","title":"Streaming Blocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          streamBlocksList(): Seq[StreamBlockData]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          streamBlocksList requests the KVStore for (a view over) StreamBlockDatas.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          streamBlocksList\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • StoragePage is requested to render
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"status/AppStatusStore/#stages","title":"Stages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          stageList(\n  statuses: JList[v1.StageStatus]): Seq[v1.StageData]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          stageList requests the KVStore for (a view over) StageDatas.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          stageList\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkStatusTracker is requested for active stage IDs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • StagesResource is requested for stages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • AllStagesPage is requested to render
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"status/AppStatusStore/#jobs","title":"Jobs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          jobsList(\n  statuses: JList[JobExecutionStatus]): Seq[v1.JobData]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          jobsList requests the KVStore for (a view over) JobDatas.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          jobsList\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkStatusTracker is requested for getJobIdsForGroup and getActiveJobIds
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • AbstractApplicationResource is requested for jobs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • AllJobsPage is requested to render
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"status/AppStatusStore/#executors","title":"Executors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          executorList(\n  activeOnly: Boolean): Seq[v1.ExecutorSummary]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          executorList requests the KVStore for (a view over) ExecutorSummarys.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          executorList\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • FIXME
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"status/AppStatusStore/#application-summary","title":"Application Summary
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          appSummary(): AppSummary\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          appSummary requests the KVStore to read the AppSummary.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          appSummary\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • AllJobsPage is requested to render
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • AllStagesPage is requested to render
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"status/ElementTrackingStore/","title":"ElementTrackingStore","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ElementTrackingStore is a KVStore that tracks the number of entities (elements) of specific types in a store and triggers actions once they reach a threshold.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"status/ElementTrackingStore/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ElementTrackingStore takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • KVStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkConf

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ElementTrackingStore is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • AppStatusStore is requested to createLiveStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • FsHistoryProvider is requested to rebuildAppStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"status/ElementTrackingStore/#writing-value-to-store","title":"Writing Value to Store
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            write(\n  value: Any): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            write\u00a0is part of the KVStore abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            write requests the KVStore to write the value

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"status/ElementTrackingStore/#writing-value-to-store-and-checking-triggers","title":"Writing Value to Store and Checking Triggers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            write(\n  value: Any,\n  checkTriggers: Boolean): WriteQueueResult\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            write writes the value.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            write...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            write is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LiveEntity is requested to write
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • StreamingQueryStatusListener (Spark Structured Streaming) is requested to onQueryStarted and onQueryTerminated
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"status/ElementTrackingStore/#creating-view-of-specific-entities","title":"Creating View of Specific Entities
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            view[T](\n  klass: Class[T]): KVStoreView[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            view\u00a0is part of the KVStore abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            view requests the KVStore for a view of klass entities.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"status/ElementTrackingStore/#registering-trigger","title":"Registering Trigger
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            addTrigger(\n  klass: Class[_],\n  threshold: Long)(\n  action: Long => Unit): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            addTrigger...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            addTrigger is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • AppStatusListener is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • HiveThriftServer2Listener (Spark Thrift Server) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SQLAppStatusListener (Spark SQL) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • StreamingQueryStatusListener (Spark Structured Streaming) is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"status/LiveEntity/","title":"LiveEntity","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            LiveEntity is an abstraction of entities of a running (live) Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"status/LiveEntity/#contract","title":"Contract","text":""},{"location":"status/LiveEntity/#doupdate","title":"doUpdate
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            doUpdate(): Any\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Updated view of this entity's data

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LiveEntity is requested to write out to the store
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"status/LiveEntity/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LiveExecutionData (Spark SQL)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LiveExecutionData (Spark Thrift Server)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LiveExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LiveExecutorStageSummary
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LiveJob
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LiveRDD
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LiveResourceProfile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LiveSessionData
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LiveStage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LiveTask
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SchedulerPool
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"status/LiveEntity/#writing-out-to-store","title":"Writing Out to Store
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            write(\n  store: ElementTrackingStore,\n  now: Long,\n  checkTriggers: Boolean = false): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            write...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            write\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • AppStatusListener is requested to update
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • HiveThriftServer2Listener (Spark Thrift Server) is requested to updateStoreWithTriggerEnabled and updateLiveStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SQLAppStatusListener (Spark SQL) is requested to update
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/","title":"Storage System","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Storage System is a core component of Apache Spark that uses BlockManager to manage blocks in memory and on disk (based on StorageLevel).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockData/","title":"BlockData","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            = BlockData

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockData is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockDataManager/","title":"BlockDataManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockDataManager is an abstraction of block data managers that manage storage for blocks of data (aka block storage management API).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockDataManager uses BlockId to uniquely identify blocks of data and ManagedBuffer to represent them.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockDataManager is used to initialize a BlockTransferService.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockDataManager is used to create a NettyBlockRpcServer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockDataManager/#contract","title":"Contract","text":""},{"location":"storage/BlockDataManager/#diagnoseshuffleblockcorruption","title":"diagnoseShuffleBlockCorruption
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            diagnoseShuffleBlockCorruption(\n  blockId: BlockId,\n  checksumByReader: Long,\n  algorithm: String): Cause\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockDataManager/#gethostlocalshuffledata","title":"getHostLocalShuffleData
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getHostLocalShuffleData(\n  blockId: BlockId,\n  dirs: Array[String]): ManagedBuffer\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffleBlockFetcherIterator is requested to fetchHostLocalBlock
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockDataManager/#getlocalblockdata","title":"getLocalBlockData
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getLocalBlockData(\n  blockId: BlockId): ManagedBuffer\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • NettyBlockRpcServer is requested to receive a request (OpenBlocks and FetchShuffleBlocks)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockDataManager/#getlocaldiskdirs","title":"getLocalDiskDirs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getLocalDiskDirs: Array[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • NettyBlockRpcServer is requested to handle a GetLocalDirsForExecutors request
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockDataManager/#putblockdata","title":"putBlockData
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putBlockData(\n  blockId: BlockId,\n  data: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Stores (puts) a block data (as a ManagedBuffer) for the given BlockId. Returns true when completed successfully or false when failed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • NettyBlockRpcServer is requested to receive a UploadBlock request
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockDataManager/#putblockdataasstream","title":"putBlockDataAsStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putBlockDataAsStream(\n  blockId: BlockId,\n  level: StorageLevel,\n  classTag: ClassTag[_]): StreamCallbackWithID\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • NettyBlockRpcServer is requested to receiveStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockDataManager/#releaselock","title":"releaseLock
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            releaseLock(\n  blockId: BlockId,\n  taskContext: Option[TaskContext]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TorrentBroadcast is requested to releaseBlockManagerLock
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to handleLocalReadFailure, getLocalValues, getOrElseUpdate, doPut, releaseLockAndDispose
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockDataManager/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockEvictionHandler/","title":"BlockEvictionHandler","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockEvictionHandler is an abstraction of block eviction handlers that can drop blocks from memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockEvictionHandler/#contract","title":"Contract","text":""},{"location":"storage/BlockEvictionHandler/#dropping-block-from-memory","title":"Dropping Block from Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            dropFromMemory[T: ClassTag](\n  blockId: BlockId,\n  data: () => Either[Array[T], ChunkedByteBuffer]): StorageLevel\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MemoryStore is requested to evict blocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockEvictionHandler/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockId/","title":"BlockId","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockId is an abstraction of data block identifiers based on an unique name.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockId/#contract","title":"Contract","text":""},{"location":"storage/BlockId/#name","title":"Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            name: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A globally unique identifier of this Block

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to putBlockDataAsStream and readDiskBlockFromSameHostExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • UpdateBlockInfo is requested to writeExternal
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DiskBlockManager is requested to getFile and containsBlock
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DiskStore is requested to getBytes, remove, moveFileToBlock, contains
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockId/#implementations","title":"Implementations","text":"Sealed Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockId is a Scala sealed abstract class which means that all of the implementations are in the same compilation unit (a single file).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockId/#broadcastblockid","title":"BroadcastBlockId

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockId for broadcast variable blocks:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • broadcastId identifier
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Optional field name (default: empty)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Uses broadcast_ prefix for the name

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • TorrentBroadcast is created, requested to store a broadcast and the blocks in a local BlockManager, and read blocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to remove all the blocks of a broadcast variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SerializerManager is requested to shouldCompress
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • AppStatusListener is requested to onBlockUpdated
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockId/#rddblockid","title":"RDDBlockId

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockId for RDD partitions:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • rddId identifier
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • splitIndex identifier

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Uses rdd_ prefix for the name

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • StorageStatus is requested to register the status of a data block, get the status of a data block, updateStorageInfo
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • LocalRDDCheckpointData is requested to doCheckpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • RDD is requested to getOrCompute
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DAGScheduler is requested for the BlockManagers (executors) for cached RDD partitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManagerMasterEndpoint is requested to removeRdd
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • AppStatusListener is requested to updateRDDBlock (when onBlockUpdated for an RDDBlockId)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Compressed when spark.rdd.compress configuration property is enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockId/#shuffleblockbatchid","title":"ShuffleBlockBatchId","text":""},{"location":"storage/BlockId/#shuffleblockid","title":"ShuffleBlockId

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockId for shuffle blocks:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • shuffleId identifier
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • mapId identifier
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • reduceId identifier

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Uses shuffle_ prefix for the name

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffleBlockFetcherIterator is requested to throwFetchFailedException
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MapOutputTracker utility is requested to convertMapStatuses
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • NettyBlockRpcServer is requested to handle a FetchShuffleBlocks message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ExternalSorter is requested to writePartitionedMapOutput
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffleBlockFetcherIterator is requested to mergeContinuousShuffleBlockIdsIfNeeded
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • IndexShuffleBlockResolver is requested to getBlockData

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Compressed when spark.shuffle.compress configuration property is enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockId/#shuffledatablockid","title":"ShuffleDataBlockId","text":""},{"location":"storage/BlockId/#shuffleindexblockid","title":"ShuffleIndexBlockId","text":""},{"location":"storage/BlockId/#streamblockid","title":"StreamBlockId

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockId for ...FIXME:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • streamId
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • uniqueId

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Uses the following name:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            input-[streamId]-[uniqueId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used in Spark Streaming

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockId/#taskresultblockid","title":"TaskResultBlockId","text":""},{"location":"storage/BlockId/#templocalblockid","title":"TempLocalBlockId","text":""},{"location":"storage/BlockId/#tempshuffleblockid","title":"TempShuffleBlockId","text":""},{"location":"storage/BlockId/#testblockid","title":"TestBlockId","text":""},{"location":"storage/BlockId/#creating-blockid-by-name","title":"Creating BlockId by Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            apply(\n  name: String): BlockId\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            apply creates one of the available BlockIds by the given name (that uses a prefix to differentiate between different BlockIds).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            apply is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • NettyBlockRpcServer is requested to handle OpenBlocks, UploadBlock messages and receiveStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • UpdateBlockInfo is requested to deserialize (readExternal)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • DiskBlockManager is requested for all the blocks (from files stored on disk)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ShuffleBlockFetcherIterator is requested to sendRequest
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • JsonProtocol utility is used to accumValueFromJson, taskMetricsFromJson and blockUpdatedInfoFromJson
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/BlockInfo/","title":"BlockInfo","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockInfo is a metadata of data blocks (stored in MemoryStore or DiskStore).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockInfo/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockInfo takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • StorageLevel
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ClassTag (Scala)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • tellMaster flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BlockInfo is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockManager is requested to doPut
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"storage/BlockInfo/#block-size","title":"Block Size

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BlockInfo knows the size of the block (in bytes).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The size is 0 by default and changes when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockStoreUpdater is requested to save
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockManager is requested to doPutIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"storage/BlockInfo/#reader-count","title":"Reader Count

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readerCount is the number of times that this block has been locked for reading

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readerCount is 0 by default.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readerCount changes back to 0 when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockInfoManager is requested to remove a block and clear

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readerCount is incremented when a read lock is acquired and decreases when the following happens:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockInfoManager is requested to release a lock and releaseAllLocksForTask
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"storage/BlockInfo/#writer-task","title":"Writer Task

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              writerTask attribute is the task ID that owns the write lock for the block or the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • -1 for no writers and hence no write lock in use
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • -1024 for non-task threads (by a driver thread or by unit test code)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                writerTask is assigned a task ID when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockInfoManager is requested to lockForWriting, unlock, releaseAllLocksForTask, removeBlock, clear
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockInfoManager/","title":"BlockInfoManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockInfoManager is used by BlockManager (and MemoryStore) to manage metadata of memory blocks and control concurrent access by locks for reading and writing.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockInfoManager is used to create a MemoryStore and a BlockManagerManagedBuffer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockInfoManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockInfoManager takes no arguments to be created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockInfoManager is created\u00a0for BlockManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockInfoManager/#block-metadata","title":"Block Metadata
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                infos: HashMap[BlockId, BlockInfo]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockInfoManager uses a registry of block metadatas per block.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockInfoManager/#locks","title":"Locks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Locks are the mechanism to control concurrent access to data and prevent destructive interaction between operations that use the same resource.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockInfoManager uses read and write locks by task attempts.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockInfoManager/#read-locks","title":"Read Locks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                readLocksByTask: HashMap[TaskAttemptId, ConcurrentHashMultiset[BlockId]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockInfoManager uses readLocksByTask registry to track tasks (by TaskAttemptId) and the blocks they locked for reading (as BlockIds).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                A new entry is added when BlockInfoManager is requested to register a task (attempt).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                A new BlockId is added to an existing task attempt in lockForReading.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockInfoManager/#write-locks","title":"Write Locks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Tracks tasks (by TaskAttemptId) and the blocks they locked for writing (as BlockId.md[]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockInfoManager/#registering-task-execution-attempt","title":"Registering Task (Execution Attempt)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerTask(\n  taskAttemptId: Long): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerTask registers a new \"empty\" entry for the given task (by the task attempt ID) to the readLocksByTask internal registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                registerTask is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockInfoManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to registerTask
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockInfoManager/#downgrading-exclusive-write-lock-to-shared-read-lock","title":"Downgrading Exclusive Write Lock to Shared Read Lock
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                downgradeLock(\n  blockId: BlockId): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                downgradeLock prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Task [currentTaskAttemptId] downgrading write lock for [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                downgradeLock...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                downgradeLock is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to doPut and downgradeLock
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockInfoManager/#obtaining-read-lock-for-block","title":"Obtaining Read Lock for Block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                lockForReading(\n  blockId: BlockId,\n  blocking: Boolean = true): Option[BlockInfo]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                lockForReading locks a given memory block for reading when the block was registered earlier and no writer tasks use it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When executed, lockForReading prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Task [currentTaskAttemptId] trying to acquire read lock for [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                lockForReading looks up the metadata of the blockId block (in the infos registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                If no metadata could be found, lockForReading returns None which means that the block does not exist or was removed (and anybody could acquire a write lock).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Otherwise, when the metadata was found (i.e. registered) lockForReading checks so-called writerTask. Only when the block has no writer tasks, a read lock can be acquired. If so, the readerCount of the block metadata is incremented and the block is recorded (in the internal readLocksByTask registry). lockForReading prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Task [currentTaskAttemptId] acquired read lock for [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The BlockInfo for the blockId block is returned.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                -1024 is a special taskAttemptId (NON_TASK_WRITER) used to mark a non-task thread, e.g. by a driver thread or by unit test code.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For blocks with writerTask other than NO_WRITER, when blocking is enabled, lockForReading waits (until another thread invokes the Object.notify method or the Object.notifyAll methods for this object).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                With blocking enabled, it will repeat the waiting-for-read-lock sequence until either None or the lock is obtained.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When blocking is disabled and the lock could not be obtained, None is returned immediately.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                lockForReading is a synchronized method, i.e. no two objects can use this and other instance methods.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                lockForReading is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockInfoManager is requested to downgradeLock and lockNewBlockForWriting
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to getLocalValues, getLocalBytes and replicateBlock
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManagerManagedBuffer is requested to retain
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockInfoManager/#obtaining-write-lock-for-block","title":"Obtaining Write Lock for Block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                lockForWriting(\n  blockId: BlockId,\n  blocking: Boolean = true): Option[BlockInfo]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                lockForWriting prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Task [currentTaskAttemptId] trying to acquire write lock for [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                lockForWriting finds the blockId (in the infos registry). When no BlockInfo could be found, None is returned. Otherwise, blockId block is checked for writerTask to be BlockInfo.NO_WRITER with no readers (i.e. readerCount is 0) and only then the lock is returned.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When the write lock can be returned, BlockInfo.writerTask is set to currentTaskAttemptId and a new binding is added to the internal writeLocksByTask registry. lockForWriting prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Task [currentTaskAttemptId] acquired write lock for [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                If, for some reason, BlockInfo.md#writerTask[blockId has a writer] or the number of readers is positive (i.e. BlockInfo.readerCount is greater than 0), the method will wait (based on the input blocking flag) and attempt the write lock acquisition process until it finishes with a write lock.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                NOTE: (deadlock possible) The method is synchronized and can block, i.e. wait that causes the current thread to wait until another thread invokes Object.notify or Object.notifyAll methods for this object.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                lockForWriting returns None for no blockId in the internal infos registry or when blocking flag is disabled and the write lock could not be acquired.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                lockForWriting is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockInfoManager is requested to lockNewBlockForWriting
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to removeBlock
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MemoryStore is requested to evictBlocksToFreeSpace
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockInfoManager/#obtaining-write-lock-for-new-block","title":"Obtaining Write Lock for New Block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                lockNewBlockForWriting(\n  blockId: BlockId,\n  newBlockInfo: BlockInfo): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                lockNewBlockForWriting obtains a write lock for blockId but only when the method could register the block.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                lockNewBlockForWriting is similar to lockForWriting method but for brand new blocks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When executed, lockNewBlockForWriting prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Task [currentTaskAttemptId] trying to put [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                If some other thread has already created the block, lockNewBlockForWriting finishes returning false. Otherwise, when the block does not exist, newBlockInfo is recorded in the infos internal registry and the block is locked for this client for writing. lockNewBlockForWriting then returns true.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                lockNewBlockForWriting executes itself in synchronized block so once the BlockInfoManager is locked the other internal registries should be available for the current thread only.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                lockNewBlockForWriting is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to doPut
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockInfoManager/#releasing-lock-on-block","title":"Releasing Lock on Block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                unlock(\n  blockId: BlockId,\n  taskAttemptId: Option[TaskAttemptId] = None): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                unlock prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Task [currentTaskAttemptId] releasing lock for [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                unlock gets the metadata for blockId (and throws an IllegalStateException if the block was not found).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                If the writer task for the block is not NO_WRITER, it becomes so and the blockId block is removed from the internal writeLocksByTask registry for the current task attempt.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Otherwise, if the writer task is indeed NO_WRITER, the block is assumed locked for reading. The readerCount counter is decremented for the blockId block and the read lock removed from the internal readLocksByTask registry for the task attempt.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In the end, unlock wakes up all the threads waiting for the BlockInfoManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                unlock is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockInfoManager is requested to downgradeLock
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to releaseLock and doPut
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManagerManagedBuffer is requested to release
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MemoryStore is requested to evictBlocksToFreeSpace
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockInfoManager/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Enable ALL logging level for org.apache.spark.storage.BlockInfoManager logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                log4j.logger.org.apache.spark.storage.BlockInfoManager=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockManager/","title":"BlockManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockManager manages the storage for blocks (chunks of data) that can be stored in memory and on disk.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockManager runs as part of the driver and executor processes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockManager provides interface for uploading and fetching blocks both locally and remotely using various stores (i.e. memory, disk, and off-heap).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Cached blocks are blocks with non-zero sum of memory and disk sizes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Tip

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Use Web UI (esp. Storage and Executors tabs) to monitor the memory used.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Tip

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Use spark-submit's command-line options (i.e. --driver-memory for the driver and --executor-memory for executors) or their equivalents as Spark properties (i.e. spark.executor.memory and spark.driver.memory) to control the memory for storage memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When External Shuffle Service is enabled, BlockManager uses ExternalShuffleClient to read shuffle files (of other executors).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockManager takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • RpcEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManagerMaster
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SerializerManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MemoryManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • MapOutputTracker
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ShuffleManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockTransferService
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SecurityManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Optional ExternalBlockStoreClient

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When created, BlockManager sets externalShuffleServiceEnabled internal flag based on spark.shuffle.service.enabled configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager then creates an instance of DiskBlockManager (requesting deleteFilesOnStop when an external shuffle service is not in use).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager creates block-manager-future daemon cached thread pool with 128 threads maximum (as futureExecutionContext).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager calculates the maximum memory to use (as maxMemory) by requesting the maximum on-heap and off-heap storage memory from the assigned MemoryManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager calculates the port used by the external shuffle service (as externalShuffleServicePort).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager creates a client to read other executors' shuffle files (as shuffleClient). If the external shuffle service is used...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager sets the maximum number of failures before this block manager refreshes the block locations from the driver (as maxFailuresBeforeLocationRefresh).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager registers a BlockManagerSlaveEndpoint with the input RpcEnv, itself, and MapOutputTracker (as slaveEndpoint).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager is created when SparkEnv is created (for the driver and executors) when a Spark application starts.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"storage/BlockManager/#memorymanager","title":"MemoryManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager is given a MemoryManager when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager uses the MemoryManager for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Create a MemoryStore (that is then assigned to MemoryManager as a \"circular dependency\")

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Initialize maxOnHeapMemory and maxOffHeapMemory (for reporting)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#diskblockmanager","title":"DiskBlockManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager creates a DiskBlockManager when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager uses the BlockManager for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Creating a DiskStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Registering an executor with a local external shuffle service (when initialized on an executor with externalShuffleServiceEnabled)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The DiskBlockManager is available as diskBlockManager reference to other Spark systems.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  import org.apache.spark.SparkEnv\nSparkEnv.get.blockManager.diskBlockManager\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#migratableresolver","title":"MigratableResolver
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  migratableResolver: MigratableResolver\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager creates a reference to a MigratableResolver by requesting the ShuffleManager for the ShuffleBlockResolver (that is assumed a MigratableResolver).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Lazy Value

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  migratableResolver is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  private[storage]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  migratableResolver is a private[storage] so it is available to others in the org.apache.spark.storage package.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  migratableResolver is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManager is requested to putBlockDataAsStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ShuffleMigrationRunnable is requested to run
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManagerDecommissioner is requested to refreshOffloadingShuffleBlocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • FallbackStorage is requested to copy
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#local-directories-for-block-storage","title":"Local Directories for Block Storage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getLocalDiskDirs: Array[String]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getLocalDiskDirs\u00a0requests the DiskBlockManager for the local directories for block storage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getLocalDiskDirs\u00a0is part of the BlockDataManager abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getLocalDiskDirs\u00a0is also used by BlockManager when requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Register with a local external shuffle service
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Initialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Re-register
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#initializing-blockmanager","title":"Initializing BlockManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  initialize(\n  appId: String): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  initialize requests the BlockTransferService to initialize.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  initialize requests the ExternalBlockStoreClient to initialize (if given).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  initialize determines the BlockReplicationPolicy based on spark.storage.replication.policy configuration property and prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Using [priorityClass] for block replication policy\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  initialize creates a BlockManagerId and requests the BlockManagerMaster to registerBlockManager (with the BlockManagerId, the local directories of the DiskBlockManager, the maxOnHeapMemory, the maxOffHeapMemory and the slaveEndpoint).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  initialize sets the internal BlockManagerId to be the response from the BlockManagerMaster (if available) or the BlockManagerId just created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  initialize initializes the External Shuffle Server's Address when enabled and prints out the following INFO message to the logs (with the externalShuffleServicePort):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  external shuffle service port = [externalShuffleServicePort]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  (only for executors and External Shuffle Service enabled) initialize registers with the External Shuffle Server.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  initialize determines the hostLocalDirManager. With spark.shuffle.readHostLocalDisk configuration property enabled and spark.shuffle.useOldFetchProtocol disabled, initialize uses the ExternalBlockStoreClient to create a HostLocalDirManager (with spark.storage.localDiskByExecutors.cacheSize configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, initialize prints out the following INFO message to the logs (with the blockManagerId):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Initialized BlockManager: [blockManagerId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  initialize is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkContext is created (on the driver)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor is created (with isLocal flag disabled)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#registering-executors-blockmanager-with-external-shuffle-server","title":"Registering Executor's BlockManager with External Shuffle Server
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  registerWithExternalShuffleServer(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  registerWithExternalShuffleServer registers the BlockManager (for an executor) with External Shuffle Service.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  registerWithExternalShuffleServer prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Registering executor with local external shuffle service.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  registerWithExternalShuffleServer creates an ExecutorShuffleInfo (with the localDirs and subDirsPerLocalDir of the DiskBlockManager, and the class name of the ShuffleManager).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  registerWithExternalShuffleServer uses spark.shuffle.registration.maxAttempts configuration property and 5 sleep time when requesting the ExternalBlockStoreClient to registerWithShuffleServer (using the BlockManagerId and the ExecutorShuffleInfo).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In case of any exception that happen below the maximum number of attempts, registerWithExternalShuffleServer prints out the following ERROR message to the logs and sleeps 5 seconds:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Failed to connect to external shuffle server, will retry [attempts] more times after waiting 5 seconds...\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#blockmanagerid","title":"BlockManagerId

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager uses a BlockManagerId for...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#hostlocaldirmanager","title":"HostLocalDirManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager can use a HostLocalDirManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Default: (undefined)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#blockreplicationpolicy","title":"BlockReplicationPolicy

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager uses a BlockReplicationPolicy for...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#external-shuffle-services-port","title":"External Shuffle Service's Port

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager determines the port of an external shuffle service when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The port is used to create the shuffleServerId and a HostLocalDirManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The port is also used for preferExecutors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#sparkdiskstoresubdirectories-configuration-property","title":"spark.diskStore.subDirectories Configuration Property

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager uses spark.diskStore.subDirectories configuration property to initialize a subDirsPerLocalDir local value.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  subDirsPerLocalDir is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • IndexShuffleBlockResolver is requested to getDataFile and getIndexFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManager is requested to readDiskBlockFromSameHostExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#fetching-block-or-computing-and-storing-it","title":"Fetching Block or Computing (and Storing) it
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getOrElseUpdate[T](\n  blockId: BlockId,\n  level: StorageLevel,\n  classTag: ClassTag[T],\n  makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Map.getOrElseUpdate

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  I think it is fair to say that getOrElseUpdate is like getOrElseUpdate of scala.collection.mutable.Map in Scala.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getOrElseUpdate(key: K, op: \u21d2 V): V\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Quoting the official scaladoc:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If given key K is already in this map, getOrElseUpdate returns the associated value V.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Otherwise, getOrElseUpdate computes a value V from given expression op, stores with the key K in the map and returns that value.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Since BlockManager is a key-value store of blocks of data identified by a block ID that seems to fit so well.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getOrElseUpdate first attempts to get the block by the BlockId (from the local block manager first and, if unavailable, requesting remote peers).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getOrElseUpdate gives the BlockResult of the block if found.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If however the block was not found (in any block manager in a Spark cluster), getOrElseUpdate doPutIterator (for the input BlockId, the makeIterator function and the StorageLevel).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getOrElseUpdate branches off per the result:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • For None, getOrElseUpdate getLocalValues for the BlockId and eventually returns the BlockResult (unless terminated by a SparkException due to some internal error)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • For Some(iter), getOrElseUpdate returns an iterator of T values

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getOrElseUpdate is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • RDD is requested to get or compute an RDD partition (for an RDDBlockId with the RDD's id and partition index).
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#fetching-block","title":"Fetching Block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  get[T: ClassTag](\n  blockId: BlockId): Option[BlockResult]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  get attempts to fetch the block (BlockId) from a local block manager first before requesting it from remote block managers. get returns a BlockResult or None (to denote \"a block is not available\").

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Internally, get tries to fetch the block from the local BlockManager. If found, get prints out the following INFO message to the logs and returns a BlockResult.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Found block [blockId] locally\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If however the block was not found locally, get tries to fetch the block from remote BlockManagers. If fetched, get prints out the following INFO message to the logs and returns a BlockResult.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Found block [blockId] remotely\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#getremotevalues","title":"getRemoteValues
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getRemoteValues[T: ClassTag](\n  blockId: BlockId): Option[BlockResult]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getRemoteValues getRemoteBlock with the bufferTransformer function that takes a ManagedBuffer and does the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Requests the SerializerManager to deserialize values from an input stream from the ManagedBuffer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Creates a BlockResult with the values (and their total size, and Network read method)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#fetching-block-bytes-from-remote-block-managers","title":"Fetching Block Bytes From Remote Block Managers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getRemoteBytes(\n  blockId: BlockId): Option[ChunkedByteBuffer]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getRemoteBytes getRemoteBlock with the bufferTransformer function that takes a ManagedBuffer and creates a ChunkedByteBuffer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getRemoteBytes is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TorrentBroadcast is requested to readBlocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TaskResultGetter is requested to enqueueSuccessfulTask
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#fetching-remote-block","title":"Fetching Remote Block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getRemoteBlock[T](\n  blockId: BlockId,\n  bufferTransformer: ManagedBuffer => T): Option[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getRemoteBlock\u00a0is used for getRemoteValues and getRemoteBytes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getRemoteBlock prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Getting remote block [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getRemoteBlock requests the BlockManagerMaster for locations and status of the input BlockId (with the host of BlockManagerId).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  With some locations, getRemoteBlock determines the size of the block (max of diskSize and memSize). getRemoteBlock tries to read the block from the local directories of another executor on the same host. getRemoteBlock prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Read [blockId] from the disk of a same host executor is [successful|failed].\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When a data block could not be found in any of the local directories, getRemoteBlock fetchRemoteManagedBuffer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  For no locations from the BlockManagerMaster, getRemoteBlock prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#readdiskblockfromsamehostexecutor","title":"readDiskBlockFromSameHostExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  readDiskBlockFromSameHostExecutor(\n  blockId: BlockId,\n  localDirs: Array[String],\n  blockSize: Long): Option[ManagedBuffer]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  readDiskBlockFromSameHostExecutor...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#fetchremotemanagedbuffer","title":"fetchRemoteManagedBuffer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  fetchRemoteManagedBuffer(\n  blockId: BlockId,\n  blockSize: Long,\n  locationsAndStatus: BlockManagerMessages.BlockLocationsAndStatus): Option[ManagedBuffer]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  fetchRemoteManagedBuffer...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#sortlocations","title":"sortLocations
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  sortLocations(\n  locations: Seq[BlockManagerId]): Seq[BlockManagerId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  sortLocations...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#preferexecutors","title":"preferExecutors
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  preferExecutors(\n  locations: Seq[BlockManagerId]): Seq[BlockManagerId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  preferExecutors...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#readdiskblockfromsamehostexecutor_1","title":"readDiskBlockFromSameHostExecutor
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  readDiskBlockFromSameHostExecutor(\n  blockId: BlockId,\n  localDirs: Array[String],\n  blockSize: Long): Option[ManagedBuffer]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  readDiskBlockFromSameHostExecutor...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#executioncontextexecutorservice","title":"ExecutionContextExecutorService

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager uses a Scala ExecutionContextExecutorService to execute FIXME asynchronously (on a thread pool with block-manager-future prefix and maximum of 128 threads).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#blockevictionhandler","title":"BlockEvictionHandler

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager is a BlockEvictionHandler that can drop a block from memory (and store it on a disk when necessary).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#shuffleclient-and-external-shuffle-service","title":"ShuffleClient and External Shuffle Service

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Danger

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  FIXME ShuffleClient and ExternalShuffleClient are dead. Long live BlockStoreClient and ExternalBlockStoreClient.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager manages the lifecycle of a ShuffleClient:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Creates when created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Inits (and possibly registers with an external shuffle server) when requested to initialize

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Closes when requested to stop

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The ShuffleClient can be an ExternalShuffleClient or the given BlockTransferService based on spark.shuffle.service.enabled configuration property. When enabled, BlockManager uses the ExternalShuffleClient.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The ShuffleClient is available to other Spark services (using shuffleClient value) and is used when BlockStoreShuffleReader is requested to read combined key-value records for a reduce task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When requested for shuffle metrics, BlockManager simply requests them from the ShuffleClient.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#blockmanager-and-rpcenv","title":"BlockManager and RpcEnv

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager is given a RpcEnv when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The RpcEnv is used to set up a BlockManagerSlaveEndpoint.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#blockinfomanager","title":"BlockInfoManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager creates a BlockInfoManager when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager requests the BlockInfoManager to clear when requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager uses the BlockInfoManager to create a MemoryStore.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager uses the BlockInfoManager when requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • reportAllBlocks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • getStatus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • getMatchingBlockIds

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • getLocalValues and getLocalBytes

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • doPut

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • replicateBlock

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • dropFromMemory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • removeRdd, removeBroadcast, removeBlock, removeBlockInternal

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • downgradeLock, releaseLock, registerTask, releaseAllLocksForTask

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#blockmanager-and-blockmanagermaster","title":"BlockManager and BlockManagerMaster

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager is given a BlockManagerMaster when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#blockmanager-as-blockdatamanager","title":"BlockManager as BlockDataManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager is a BlockDataManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#blockmanager-and-mapoutputtracker","title":"BlockManager and MapOutputTracker

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager is given a MapOutputTracker when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#executor-id","title":"Executor ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager is given an Executor ID when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The Executor ID is one of the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • driver (SparkContext.DRIVER_IDENTIFIER) for the driver

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Value of --executor-id command-line argument for CoarseGrainedExecutorBackend executors

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#blockmanagerendpoint-rpc-endpoint","title":"BlockManagerEndpoint RPC Endpoint

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager requests the RpcEnv to register a BlockManagerSlaveEndpoint under the name BlockManagerEndpoint[ID].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The RPC endpoint is used when BlockManager is requested to initialize and reregister (to register the BlockManager on an executor with the BlockManagerMaster on the driver).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The endpoint is stopped (by requesting the RpcEnv to stop the reference) when BlockManager is requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#accessing-blockmanager","title":"Accessing BlockManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager is available using SparkEnv on the driver and executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  import org.apache.spark.SparkEnv\nval bm = SparkEnv.get.blockManager\n\nscala> :type bm\norg.apache.spark.storage.BlockManager\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#blockstoreclient","title":"BlockStoreClient

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager uses a BlockStoreClient to read other executors' blocks. This is an ExternalBlockStoreClient (when given and an external shuffle service is used) or a BlockTransferService (to directly connect to other executors).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  This BlockStoreClient is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockStoreShuffleReader is requested to read combined key-values for a reduce task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Create the HostLocalDirManager (when BlockManager is initialized)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • As the shuffleMetricsSource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • registerWithExternalShuffleServer (when an external shuffle server is used and the ExternalBlockStoreClient defined)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#blocktransferservice","title":"BlockTransferService

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager is given a BlockTransferService when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  There is only one concrete BlockTransferService that is NettyBlockTransferService and there seem to be no way to reconfigure Apache Spark to use a different implementation (if there were any).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockTransferService is used when BlockManager is requested to fetch a block from and replicate a block to remote block managers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockTransferService is used as the BlockStoreClient (unless an ExternalBlockStoreClient is specified).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockTransferService is initialized with this BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockTransferService is closed when BlockManager is requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#shufflemanager","title":"ShuffleManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager is given a ShuffleManager when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager uses the ShuffleManager for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Retrieving a block data (for shuffle blocks)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Retrieving a non-shuffle block data (for shuffle blocks anyway)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Registering an executor with a local external shuffle service (when initialized on an executor with externalShuffleServiceEnabled)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#memorystore","title":"MemoryStore

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager creates a MemoryStore when created (with the BlockInfoManager, the SerializerManager, the MemoryManager and itself as a BlockEvictionHandler).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager requests the MemoryManager to use the MemoryStore.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager uses the MemoryStore for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • getStatus and getCurrentBlockStatus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • getLocalValues

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • doGetLocalBytes

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • doPutBytes and doPutIterator

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • maybeCacheDiskBytesInMemory and maybeCacheDiskValuesInMemory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • dropFromMemory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • removeBlockInternal

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The MemoryStore is requested to clear when BlockManager is requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The MemoryStore is available as memoryStore private reference to other Spark services.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  import org.apache.spark.SparkEnv\nSparkEnv.get.blockManager.memoryStore\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The MemoryStore is used (via SparkEnv.get.blockManager.memoryStore reference) when Task is requested to run (that has just finished execution and requests the MemoryStore to release unroll memory).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#diskstore","title":"DiskStore

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager creates a DiskStore (with the DiskBlockManager) when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager uses the DiskStore when requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • getStatus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • getCurrentBlockStatus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • getLocalValues
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • doGetLocalBytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • doPutIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • dropFromMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • removeBlockInternal

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  DiskStore is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ByteBufferBlockStoreUpdater is requested to saveToDiskStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TempFileBasedBlockStoreUpdater is requested to blockData and saveToDiskStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#performance-metrics","title":"Performance Metrics

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager uses BlockManagerSource to report metrics under the name BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#getpeers","title":"getPeers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getPeers(\n  forceFetch: Boolean): Seq[BlockManagerId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getPeers...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getPeers is used when BlockManager is requested to replicateBlock and replicate.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#releasing-all-locks-for-task","title":"Releasing All Locks For Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  releaseAllLocksForTask(\n  taskAttemptId: Long): Seq[BlockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  releaseAllLocksForTask...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  releaseAllLocksForTask is used when TaskRunner is requested to run (at the end of a task).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#stopping-blockmanager","title":"Stopping BlockManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  stop...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  stop is used when SparkEnv is requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#getting-ids-of-existing-blocks-for-a-given-filter","title":"Getting IDs of Existing Blocks (For a Given Filter)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getMatchingBlockIds(\n  filter: BlockId => Boolean): Seq[BlockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getMatchingBlockIds...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getMatchingBlockIds is used when BlockManagerSlaveEndpoint is requested to handle a GetMatchingBlockIds message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#getting-local-block","title":"Getting Local Block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getLocalValues(\n  blockId: BlockId): Option[BlockResult]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getLocalValues prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Getting local block [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getLocalValues obtains a read lock for blockId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When no blockId block was found, you should see the following DEBUG message in the logs and getLocalValues returns \"nothing\" (i.e. NONE).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Block [blockId] was not found\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When the blockId block was found, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Level for block [blockId] is [level]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If blockId block has memory level and is registered in MemoryStore, getLocalValues returns a BlockResult as Memory read method and with a CompletionIterator for an interator:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1. Values iterator from MemoryStore for blockId for \"deserialized\" persistence levels.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  2. Iterator from SerializerManager after the data stream has been deserialized for the blockId block and the bytes for blockId block for \"serialized\" persistence levels.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getLocalValues is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TorrentBroadcast is requested to readBroadcastBlock

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManager is requested to get and getOrElseUpdate

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#maybecachediskvaluesinmemory","title":"maybeCacheDiskValuesInMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  maybeCacheDiskValuesInMemory[T](\n  blockInfo: BlockInfo,\n  blockId: BlockId,\n  level: StorageLevel,\n  diskIterator: Iterator[T]): Iterator[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  maybeCacheDiskValuesInMemory...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#retrieving-block-data","title":"Retrieving Block Data
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getBlockData(\n  blockId: BlockId): ManagedBuffer\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getBlockData is part of the BlockDataManager abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  For a BlockId.md[] of a shuffle (a ShuffleBlockId), getBlockData requests the <> for the shuffle:ShuffleManager.md#shuffleBlockResolver[ShuffleBlockResolver] that is then requested for shuffle:ShuffleBlockResolver.md#getBlockData[getBlockData].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Otherwise, getBlockData <> for the given BlockId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If found, getBlockData creates a new BlockManagerManagedBuffer (with the <>, the input BlockId, the retrieved BlockData and the dispose flag enabled).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If not found, getBlockData <> that the block could not be found (and that the master should no longer assume the block is available on this executor) and throws a BlockNotFoundException.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: getBlockData is executed for shuffle blocks or local blocks that the BlockManagerMaster knows this executor really has (unless BlockManagerMaster is outdated).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#retrieving-non-shuffle-local-block-data","title":"Retrieving Non-Shuffle Local Block Data
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getLocalBytes(\n  blockId: BlockId): Option[BlockData]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getLocalBytes...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getLocalBytes is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TorrentBroadcast is requested to readBlocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManager is requested for the block data (of a non-shuffle block)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#storing-block-data-locally","title":"Storing Block Data Locally
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  putBlockData(\n  blockId: BlockId,\n  data: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  putBlockData is part of the BlockDataManager abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  putBlockData putBytes with Java NIO's ByteBuffer of the given ManagedBuffer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#storing-block-bytebuffer-locally","title":"Storing Block (ByteBuffer) Locally
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  putBytes(\n  blockId: BlockId,\n  bytes: ChunkedByteBuffer,\n  level: StorageLevel,\n  tellMaster: Boolean = true): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  putBytes creates a ByteBufferBlockStoreUpdater that is then requested to store the bytes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  putBytes is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManager is requested to puts a block data locally
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TaskRunner is requested to run (and the result size is above maxDirectResultSize)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TorrentBroadcast is requested to writeBlocks and readBlocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#doputbytes","title":"doPutBytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  doPutBytes[T](\n  blockId: BlockId,\n  bytes: ChunkedByteBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[T],\n  tellMaster: Boolean = true,\n  keepReadLock: Boolean = false): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  doPutBytes calls the internal helper <> with a function that accepts a BlockInfo and does the uploading.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Inside the function, if the StorageLevel.md[storage level]'s replication is greater than 1, it immediately starts <> of the blockId block on a separate thread (from futureExecutionContext thread pool). The replication uses the input bytes and level storage level.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  For a memory storage level, the function checks whether the storage level is deserialized or not. For a deserialized storage level, BlockManager's serializer:SerializerManager.md#dataDeserializeStream[SerializerManager deserializes bytes into an iterator of values] that MemoryStore.md#putIteratorAsValues[MemoryStore stores]. If however the storage level is not deserialized, the function requests MemoryStore.md#putBytes[MemoryStore to store the bytes]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If the put did not succeed and the storage level is to use disk, you should see the following WARN message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Persisting block [blockId] to disk instead.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  And DiskStore.md#putBytes[DiskStore stores the bytes].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: DiskStore.md[DiskStore] is requested to store the bytes of a block with memory and disk storage level only when MemoryStore.md[MemoryStore] has failed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If the storage level is to use disk only, DiskStore.md#putBytes[DiskStore stores the bytes].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  doPutBytes requests <> and if the block was successfully stored, and the driver should know about it (tellMaster), the function <>. The executor:TaskMetrics.md#incUpdatedBlockStatuses[current TaskContext metrics are updated with the updated block status] (only when executed inside a task where TaskContext is available).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  You should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Put block [blockId] locally took [time] ms\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The function waits till the earlier asynchronous replication finishes for a block with replication level greater than 1.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The final result of doPutBytes is the result of storing the block successful or not (as computed earlier).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: doPutBytes is used exclusively when BlockManager is requested to <>.","text":""},{"location":"storage/BlockManager/#putting-new-block","title":"Putting New Block

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  doPut[T](\n  blockId: BlockId,\n  level: StorageLevel,\n  classTag: ClassTag[_],\n  tellMaster: Boolean,\n  keepReadLock: Boolean)(putBody: BlockInfo => Option[T]): Option[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  doPut requires that the given StorageLevel is valid.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  doPut creates a new BlockInfo and requests the BlockInfoManager for a write lock for the block.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  doPut executes the given putBody function (with the BlockInfo).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If the result of putBody function is None, the block is considered saved successfully.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  For successful save, doPut requests the BlockInfoManager to downgradeLock or unlock based on the given keepReadLock flag (true and false, respectively).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  For unsuccessful save (when putBody returned some value), doPut removeBlockInternal and prints out the following WARN message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Putting block [blockId] failed\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, doPut prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Putting block [blockId] [withOrWithout] replication took [usedTime] ms\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  doPut is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockStoreUpdater is requested to save
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManager is requested to doPutIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#removing-block","title":"Removing Block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  removeBlock(\n  blockId: BlockId,\n  tellMaster: Boolean = true): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  removeBlock prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Removing block [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  removeBlock requests the BlockInfoManager for write lock on the block.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  With a write lock on the block, removeBlock removeBlockInternal (with the tellMaster flag turned on when the input tellMaster flag and the tellMaster of the block itself are both turned on).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, removeBlock addUpdatedBlockStatusToTaskMetrics (with an empty BlockStatus).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In case the block is no longer available (None), removeBlock prints out the following WARN message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Asked to remove block [blockId], which does not exist\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  removeBlock is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManager is requested to handleLocalReadFailure, removeRdd, removeBroadcast
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManagerDecommissioner is requested to migrate a block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManagerStorageEndpoint is requested to handle a RemoveBlock message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#removing-rdd-blocks","title":"Removing RDD Blocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  removeRdd(\n  rddId: Int): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  removeRdd removes all the blocks that belong to the rddId RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  It prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Removing RDD [rddId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  It then requests RDD blocks from BlockInfoManager.md[] and <> (without informing the driver).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The number of blocks removed is the final result.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: It is used by BlockManagerSlaveEndpoint.md#RemoveRdd[BlockManagerSlaveEndpoint while handling RemoveRdd messages].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#removing-all-blocks-of-broadcast-variable","title":"Removing All Blocks of Broadcast Variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  removeBroadcast(broadcastId: Long, tellMaster: Boolean): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  removeBroadcast removes all the blocks of the input broadcastId broadcast.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Internally, it starts by printing out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Removing broadcast [broadcastId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  It then requests all the BlockId.md#BroadcastBlockId[BroadcastBlockId] objects that belong to the broadcastId broadcast from BlockInfoManager.md[] and <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The number of blocks removed is the final result.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: It is used by BlockManagerSlaveEndpoint.md#RemoveBroadcast[BlockManagerSlaveEndpoint while handling RemoveBroadcast messages].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#external-shuffle-servers-address","title":"External Shuffle Server's Address
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  shuffleServerId: BlockManagerId\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When requested to initialize, BlockManager records the location (BlockManagerId) of External Shuffle Service if enabled or simply uses the non-external-shuffle-service BlockManagerId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The BlockManagerId is used to register an executor with a local external shuffle service.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The BlockManagerId is used as the location of a shuffle map output when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BypassMergeSortShuffleWriter is requested to write partition records to a shuffle file
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • UnsafeShuffleWriter is requested to close and write output
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SortShuffleWriter is requested to write output
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#getstatus","title":"getStatus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getStatus(\n  blockId: BlockId): Option[BlockStatus]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getStatus...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getStatus is used when BlockManagerSlaveEndpoint is requested to handle GetBlockStatus message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#re-registering-blockmanager-with-driver","title":"Re-registering BlockManager with Driver
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  reregister(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  reregister prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockManager [blockManagerId] re-registering with master\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  reregister requests the BlockManagerMaster to register this BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, reregister reportAllBlocks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  reregister is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Executor is requested to reportHeartBeat (and informed to re-register)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManager is requested to asyncReregister
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#reporting-all-blocks","title":"Reporting All Blocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  reportAllBlocks(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  reportAllBlocks prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Reporting [n] blocks to the master.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  For all the blocks in the BlockInfoManager, reportAllBlocks getCurrentBlockStatus and tryToReportBlockStatus (for blocks tracked by the master).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  reportAllBlocks prints out the following ERROR message to the logs and exits when block status reporting fails for any block:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Failed to report [blockId] to master; giving up.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#calculate-current-block-status","title":"Calculate Current Block Status
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getCurrentBlockStatus(\n  blockId: BlockId,\n  info: BlockInfo): BlockStatus\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getCurrentBlockStatus gives the current BlockStatus of the BlockId block (with the block's current StorageLevel.md[StorageLevel], memory and disk sizes). It uses MemoryStore.md[MemoryStore] and DiskStore.md[DiskStore] for size and other information.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: Most of the information to build BlockStatus is already in BlockInfo except that it may not necessarily reflect the current state per MemoryStore.md[MemoryStore] and DiskStore.md[DiskStore].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Internally, it uses the input BlockInfo.md[] to know about the block's storage level. If the storage level is not set (i.e. null), the returned BlockStatus assumes the StorageLevel.md[default NONE storage level] and the memory and disk sizes being 0.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If however the storage level is set, getCurrentBlockStatus uses MemoryStore.md[MemoryStore] and DiskStore.md[DiskStore] to check whether the block is stored in the storages or not and request for their sizes in the storages respectively (using their getSize or assume 0).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: It is acceptable that the BlockInfo says to use memory or disk yet the block is not in the storages (yet or anymore). The method will give current status.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  getCurrentBlockStatus is used when <>, <> or <> or <>.","text":""},{"location":"storage/BlockManager/#reporting-current-storage-status-of-block-to-driver","title":"Reporting Current Storage Status of Block to Driver

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  reportBlockStatus(\n  blockId: BlockId,\n  status: BlockStatus,\n  droppedMemorySize: Long = 0L): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  reportBlockStatus tryToReportBlockStatus.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  If told to re-register, reportBlockStatus prints out the following INFO message to the logs followed by asynchronous re-registration:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Got told to re-register updating block [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In the end, reportBlockStatus prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Told master about block [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  reportBlockStatus is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • IndexShuffleBlockResolver is requested to
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockStoreUpdater is requested to save
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManager is requested to getLocalBlockData, doPutIterator, dropFromMemory, removeBlockInternal
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#reporting-block-status-update-to-driver","title":"Reporting Block Status Update to Driver
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  tryToReportBlockStatus(\n  blockId: BlockId,\n  status: BlockStatus,\n  droppedMemorySize: Long = 0L): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  tryToReportBlockStatus reports block status update to the BlockManagerMaster and returns its response.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  tryToReportBlockStatus is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManager is requested to reportAllBlocks, reportBlockStatus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#execution-context","title":"Execution Context

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  block-manager-future is the execution context for...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#bytebuffer","title":"ByteBuffer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The underlying abstraction for blocks in Spark is a ByteBuffer that limits the size of a block to 2GB (Integer.MAX_VALUE - see Why does FileChannel.map take up to Integer.MAX_VALUE of data? and SPARK-1476 2GB limit in spark for blocks). This has implication not just for managed blocks in use, but also for shuffle blocks (memory mapped blocks are limited to 2GB, even though the API allows for long), ser-deser via byte array-backed output streams.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/BlockManager/#blockresult","title":"BlockResult

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  BlockResult is a metadata of a fetched block:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Data (Iterator[Any])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • DataReadMethod
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Size (bytes)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    BlockResult is created and returned when BlockManager is requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • getOrElseUpdate
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • get
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • getLocalValues
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • getRemoteValues
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#datareadmethod","title":"DataReadMethod

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DataReadMethod describes how block data was read.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DataReadMethod Source Disk DiskStore (while getLocalValues) Hadoop seems unused Memory MemoryStore (while getLocalValues) Network Remote BlockManagers (aka network)","text":""},{"location":"storage/BlockManager/#registering-task","title":"Registering Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    registerTask(\n  taskAttemptId: Long): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    registerTask requests the BlockInfoManager to register a given task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    registerTask is used when Task is requested to run (at the start of a task).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#creating-diskblockobjectwriter","title":"Creating DiskBlockObjectWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getDiskWriter(\n  blockId: BlockId,\n  file: File,\n  serializerInstance: SerializerInstance,\n  bufferSize: Int,\n  writeMetrics: ShuffleWriteMetrics): DiskBlockObjectWriter\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getDiskWriter creates a DiskBlockObjectWriter (with spark.shuffle.sync configuration property for syncWrites argument).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getDiskWriter uses the SerializerManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getDiskWriter is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BypassMergeSortShuffleWriter is requested to write records (of a partition)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleExternalSorter is requested to writeSortedFile

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExternalAppendOnlyMap is requested to spillMemoryIteratorToDisk

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExternalSorter is requested to spillMemoryIteratorToDisk and writePartitionedFile

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • UnsafeSorterSpillWriter is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#recording-updated-blockstatus-in-taskmetrics-of-current-task","title":"Recording Updated BlockStatus in TaskMetrics (of Current Task)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    addUpdatedBlockStatusToTaskMetrics(\n  blockId: BlockId,\n  status: BlockStatus): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    addUpdatedBlockStatusToTaskMetrics takes an active TaskContext (if available) and records updated BlockStatus for Block (in the task's TaskMetrics).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    addUpdatedBlockStatusToTaskMetrics is used when BlockManager doPutBytes (for a block that was successfully stored), doPut, doPutIterator, removes blocks from memory (possibly spilling it to disk) and removes block from memory and disk.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#shuffle-metrics-source","title":"Shuffle Metrics Source
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    shuffleMetricsSource: Source\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    shuffleMetricsSource creates a ShuffleMetricsSource with the shuffleMetrics (of the BlockStoreClient) and the source name as follows:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExternalShuffle when ExternalBlockStoreClient is specified
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • NettyBlockTransfer otherwise

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    shuffleMetricsSource is available using SparkEnv:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    env.blockManager.shuffleMetricsSource\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    shuffleMetricsSource is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Executor is created (for non-local / cluster modes)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#replicating-block-to-peers","title":"Replicating Block To Peers
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    replicate(\n  blockId: BlockId,\n  data: BlockData,\n  level: StorageLevel,\n  classTag: ClassTag[_],\n  existingReplicas: Set[BlockManagerId] = Set.empty): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    replicate...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    replicate is used when BlockManager is requested to doPutBytes, doPutIterator and replicateBlock.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#replicateblock","title":"replicateBlock
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    replicateBlock(\n  blockId: BlockId,\n  existingReplicas: Set[BlockManagerId],\n  maxReplicas: Int): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    replicateBlock...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    replicateBlock is used when BlockManagerSlaveEndpoint is requested to handle a ReplicateBlock message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#putiterator","title":"putIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    putIterator[T: ClassTag](\n  blockId: BlockId,\n  values: Iterator[T],\n  level: StorageLevel,\n  tellMaster: Boolean = true): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    putIterator...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    putIterator is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to putSingle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#putsingle","title":"putSingle
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    putSingle[T: ClassTag](\n  blockId: BlockId,\n  value: T,\n  level: StorageLevel,\n  tellMaster: Boolean = true): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    putSingle...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    putSingle is used when TorrentBroadcast is requested to write the blocks and readBroadcastBlock.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#doputiterator","title":"doPutIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    doPutIterator[T](\n  blockId: BlockId,\n  iterator: () => Iterator[T],\n  level: StorageLevel,\n  classTag: ClassTag[T],\n  tellMaster: Boolean = true,\n  keepReadLock: Boolean = false): Option[PartiallyUnrolledIterator[T]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    doPutIterator doPut with the putBody function.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    doPutIterator is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to getOrElseUpdate and putIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#putbody","title":"putBody
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    putBody: BlockInfo => Option[T]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    For the given StorageLevel that indicates to use memory for storage, putBody requests the MemoryStore to putIteratorAsValues or putIteratorAsBytes based on the StorageLevel (that indicates to use deserialized format or not, respectively).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In case storing the block in memory was not possible (due to lack of available memory), putBody prints out the following WARN message to the logs and falls back on the DiskStore to store the block.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Persisting block [blockId] to disk instead.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    For the given StorageLevel that indicates to use disk storage only (useMemory flag is disabled), putBody requests the DiskStore to store the block.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    putBody gets the current block status and checks whether the StorageLevel is valid (that indicates that the block was stored successfully).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    If the block was stored successfully, putBody reports the block status (only if indicated by the the given tellMaster flag and the tellMaster flag of the associated BlockInfo) and addUpdatedBlockStatusToTaskMetrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    putBody prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Put block [blockId] locally took [duration] ms\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    For the given StorageLevel with replication enabled (above 1), putBody doGetLocalBytes and replicates the block (to other BlockManagers). putBody prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Put block [blockId] remotely took [duration] ms\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#dogetlocalbytes","title":"doGetLocalBytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    doGetLocalBytes(\n  blockId: BlockId,\n  info: BlockInfo): BlockData\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    doGetLocalBytes...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    doGetLocalBytes\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to getLocalBytes, doPutIterator and replicateBlock
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#dropping-block-from-memory","title":"Dropping Block from Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    dropFromMemory(\n  blockId: BlockId,\n  data: () => Either[Array[T], ChunkedByteBuffer]): StorageLevel\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    dropFromMemory prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Dropping block [blockId] from memory\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    dropFromMemory requests the BlockInfoManager to assert that the block is locked for writing (that gives a BlockInfo or throws a SparkException).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    dropFromMemory drops to disk if the current storage level requires so (based on the given BlockInfo) and the block is not in the DiskStore already. dropFromMemory prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Writing block [blockId] to disk\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    dropFromMemory uses the given data to determine whether the DiskStore is requested to put or putBytes (Array[T] or ChunkedByteBuffer, respectively).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    dropFromMemory requests the MemoryStore to remove the block. dropFromMemory prints out the following WARN message to the logs if the block was not found in the MemoryStore:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Block [blockId] could not be dropped from memory as it does not exist\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    dropFromMemory gets the current block status and reportBlockStatus when requested (when the tellMaster flag of the BlockInfo is turned on).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    dropFromMemory addUpdatedBlockStatusToTaskMetrics when the block has been updated (dropped to disk or removed from the MemoryStore).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In the end, dropFromMemory returns the current StorageLevel of the block (off the BlockStatus).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    dropFromMemory is part of the BlockEvictionHandler abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#releaselock-method","title":"releaseLock Method
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    releaseLock(\n  blockId: BlockId,\n  taskAttemptId: Option[Long] = None): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    releaseLock requests the BlockInfoManager to unlock the given block.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    releaseLock is part of the BlockDataManager abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#putblockdataasstream","title":"putBlockDataAsStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    putBlockDataAsStream(\n  blockId: BlockId,\n  level: StorageLevel,\n  classTag: ClassTag[_]): StreamCallbackWithID\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    putBlockDataAsStream is part of the BlockDataManager abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    putBlockDataAsStream...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#maximum-memory","title":"Maximum Memory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Total maximum value that BlockManager can ever possibly use (that depends on MemoryManager and may vary over time).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Total available on-heap and off-heap memory for storage (in bytes)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#maximum-off-heap-memory","title":"Maximum Off-Heap Memory","text":""},{"location":"storage/BlockManager/#maximum-on-heap-memory","title":"Maximum On-Heap Memory","text":""},{"location":"storage/BlockManager/#decommissionself","title":"decommissionSelf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    decommissionSelf(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    decommissionSelf...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    decommissionSelf is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BlockManagerStorageEndpoint is requested to handle a DecommissionBlockManager message
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#decommissionblockmanager","title":"decommissionBlockManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    decommissionBlockManager(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    decommissionBlockManager sends a DecommissionBlockManager message to the BlockManagerStorageEndpoint.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    decommissionBlockManager is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • CoarseGrainedExecutorBackend is requested to decommissionSelf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#blockmanagerstorageendpoint","title":"BlockManagerStorageEndpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    storageEndpoint: RpcEndpointRef\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    BlockManager sets up a RpcEndpointRef (within the RpcEnv) under the name BlockManagerEndpoint[ID] with a BlockManagerStorageEndpoint message handler.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#blockmanagerdecommissioner","title":"BlockManagerDecommissioner
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    decommissioner: Option[BlockManagerDecommissioner]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    BlockManager defines decommissioner internal registry for a BlockManagerDecommissioner.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    decommissioner is undefined (None) by default.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    BlockManager creates and starts a BlockManagerDecommissioner when requested to decommissionSelf.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    decommissioner is used for isDecommissioning and lastMigrationInfo.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    BlockManager requests the BlockManagerDecommissioner to stop when stopped.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#removing-block-from-memory-and-disk","title":"Removing Block from Memory and Disk
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    removeBlockInternal(\n  blockId: BlockId,\n  tellMaster: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    For tellMaster turned on, removeBlockInternal requests the BlockInfoManager to assert that the block is locked for writing and remembers the current block status. Otherwise, removeBlockInternal leaves the block status undetermined.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    removeBlockInternal requests the MemoryStore to remove the block.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    removeBlockInternal requests the DiskStore to remove the block.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    removeBlockInternal requests the BlockInfoManager to remove the block metadata.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In the end, removeBlockInternal reports the block status (to the master) with the storage level changed to NONE.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    removeBlockInternal prints out the following WARN message when the block was not stored in the MemoryStore and the DiskStore:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Block [blockId] could not be removed as it was not found on disk or in memory\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    removeBlockInternal is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to put a new block and remove a block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#maybecachediskbytesinmemory","title":"maybeCacheDiskBytesInMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    maybeCacheDiskBytesInMemory(\n  blockInfo: BlockInfo,\n  blockId: BlockId,\n  level: StorageLevel,\n  diskData: BlockData): Option[ChunkedByteBuffer]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    maybeCacheDiskBytesInMemory...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    maybeCacheDiskBytesInMemory is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to getLocalValues and doGetLocalBytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManager/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Enable ALL logging level for org.apache.spark.storage.BlockManager logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    log4j.logger.org.apache.spark.storage.BlockManager=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/BlockManagerDecommissioner/","title":"BlockManagerDecommissioner","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    BlockManagerDecommissioner is a decommissioning process used by BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"storage/BlockManagerDecommissioner/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    BlockManagerDecommissioner takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BlockManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      BlockManagerDecommissioner is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • BlockManager is requested to decommissionSelf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"storage/BlockManagerId/","title":"BlockManagerId","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      BlockManagerId is a unique identifier (address) of a BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"storage/BlockManagerInfo/","title":"BlockManagerInfo","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      = BlockManagerInfo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      BlockManagerInfo is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"storage/BlockManagerMaster/","title":"BlockManagerMaster","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      BlockManagerMaster runs on the driver and executors to exchange block metadata (status and locations) in a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      BlockManagerMaster uses BlockManagerMasterEndpoint (registered as BlockManagerMaster RPC endpoint on the driver with the endpoint references on executors) for executors to send block status updates and so let the driver keep track of block status and locations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"storage/BlockManagerMaster/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      BlockManagerMaster takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Driver Endpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Heartbeat Endpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • isDriver flag (whether it is created for the driver or executors)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        BlockManagerMaster is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkEnv utility is used to create a SparkEnv (and create a BlockManager)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMaster/#driver-endpoint","title":"Driver Endpoint

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        BlockManagerMaster is given a RpcEndpointRef of the BlockManagerMaster RPC Endpoint (on the driver) when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManagerMaster/#heartbeat-endpoint","title":"Heartbeat Endpoint

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        BlockManagerMaster is given a RpcEndpointRef of the BlockManagerMasterHeartbeat RPC Endpoint (on the driver) when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The endpoint is used (mainly) when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DAGScheduler is requested to executorHeartbeatReceived
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManagerMaster/#registering-blockmanager-on-executor-with-driver","title":"Registering BlockManager (on Executor) with Driver
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        registerBlockManager(\n  id: BlockManagerId,\n  localDirs: Array[String],\n  maxOnHeapMemSize: Long,\n  maxOffHeapMemSize: Long,\n  storageEndpoint: RpcEndpointRef): BlockManagerId\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        registerBlockManager prints out the following INFO message to the logs (with the given BlockManagerId):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Registering BlockManager [id]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        registerBlockManager notifies the driver (using the BlockManagerMaster RPC endpoint) that the BlockManagerId wants to register (and sends a blocking RegisterBlockManager message).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The input maxMemSize is the total available on-heap and off-heap memory for storage on the BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        registerBlockManager waits until a confirmation comes (as a possibly-updated BlockManagerId).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the end, registerBlockManager prints out the following INFO message to the logs and returns the BlockManagerId received.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Registered BlockManager [updatedId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        registerBlockManager\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to initialize and reregister
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • FallbackStorage utility is used to registerBlockManagerIfNeeded
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManagerMaster/#finding-block-locations-for-single-block","title":"Finding Block Locations for Single Block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getLocations(\n  blockId: BlockId): Seq[BlockManagerId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getLocations requests the driver (using the BlockManagerMaster RPC endpoint) for BlockManagerIds of the given BlockId (and sends a blocking GetLocations message).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getLocations\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to fetchRemoteManagedBuffer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManagerMaster is requested to contains a BlockId
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManagerMaster/#finding-block-locations-for-multiple-blocks","title":"Finding Block Locations for Multiple Blocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getLocations(\n  blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getLocations requests the driver (using the BlockManagerMaster RPC endpoint) for BlockManagerIds of the given BlockIds (and sends a blocking GetLocationsMultipleBlockIds message).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getLocations\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DAGScheduler is requested for BlockManagers (executors) for cached RDD partitions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to getLocationBlockIds
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManager utility is used to blockIdsToLocations
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManagerMaster/#contains","title":"contains
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        contains(\n  blockId: BlockId): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        contains is positive (true) when there is at least one executor with the given BlockId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        contains\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • LocalRDDCheckpointData is requested to doCheckpoint
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManagerMaster/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Enable ALL logging level for org.apache.spark.storage.BlockManagerMaster logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        log4j.logger.org.apache.spark.storage.BlockManagerMaster=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/BlockManagerMasterEndpoint/","title":"BlockManagerMasterEndpoint","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        BlockManagerMasterEndpoint is a rpc:RpcEndpoint.md#ThreadSafeRpcEndpoint[ThreadSafeRpcEndpoint] for storage:BlockManagerMaster.md[BlockManagerMaster].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        BlockManagerMasterEndpoint is registered under BlockManagerMaster name.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        BlockManagerMasterEndpoint tracks status of the storage:BlockManager.md[BlockManagers] (on the executors) in a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[creating-instance]] Creating Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        BlockManagerMasterEndpoint takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • [[rpcEnv]] rpc:RpcEnv.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • [[isLocal]] Flag whether BlockManagerMasterEndpoint works in local or cluster mode
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • [[conf]] SparkConf.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • [[listenerBus]] scheduler:LiveListenerBus.md[]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        BlockManagerMasterEndpoint is created for the core:SparkEnv.md#create[SparkEnv] on the driver (to create a storage:BlockManagerMaster.md[] for a storage:BlockManager.md#master[BlockManager]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When created, BlockManagerMasterEndpoint prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#blockmanagermasterendpoint-up","title":"BlockManagerMasterEndpoint up","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[messages]][[receiveAndReply]] Messages

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        As an rpc:RpcEndpoint.md[], BlockManagerMasterEndpoint handles RPC messages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[BlockManagerHeartbeat]] BlockManagerHeartbeat

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        BlockManagerHeartbeat( blockManagerId: BlockManagerId)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[GetLocations]] GetLocations

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        GetLocations( blockId: BlockId)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint replies with the <> of blockId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when BlockManagerMaster.md#getLocations-block[BlockManagerMaster requests the block locations of a single block].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[GetLocationsAndStatus]] GetLocationsAndStatus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        GetLocationsAndStatus( blockId: BlockId)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[GetLocationsMultipleBlockIds]] GetLocationsMultipleBlockIds

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        GetLocationsMultipleBlockIds( blockIds: Array[BlockId])

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint replies with the <> for the given storage:BlockId.md[].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when BlockManagerMaster.md#getLocations[BlockManagerMaster requests the block locations for multiple blocks].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[GetPeers]] GetPeers

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_4","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        GetPeers( blockManagerId: BlockManagerId)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint replies with the <> of blockManagerId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Peers of a storage:BlockManager.md[BlockManager] are the other BlockManagers in a cluster (except the driver's BlockManager). Peers are used to know the available executors in a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when BlockManagerMaster.md#getPeers[BlockManagerMaster requests the peers of a BlockManager].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[GetExecutorEndpointRef]] GetExecutorEndpointRef

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_5","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        GetExecutorEndpointRef( executorId: String)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[GetMemoryStatus]] GetMemoryStatus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_6","title":"[source, scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#getmemorystatus","title":"GetMemoryStatus","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[GetStorageStatus]] GetStorageStatus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_7","title":"[source, scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#getstoragestatus","title":"GetStorageStatus","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[GetBlockStatus]] GetBlockStatus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_8","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        GetBlockStatus( blockId: BlockId, askSlaves: Boolean = true)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[GetMatchingBlockIds]] GetMatchingBlockIds

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_9","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        GetMatchingBlockIds( filter: BlockId => Boolean, askSlaves: Boolean = true)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[HasCachedBlocks]] HasCachedBlocks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_10","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        HasCachedBlocks( executorId: String)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[RegisterBlockManager]] RegisterBlockManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala","title":"[source,scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        RegisterBlockManager( blockManagerId: BlockManagerId, maxOnHeapMemSize: Long, maxOffHeapMemSize: Long, sender: RpcEndpointRef)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint is requested to <> (by the given storage:BlockManagerId.md[]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when BlockManagerMaster is requested to storage:BlockManagerMaster.md#registerBlockManager[register a BlockManager]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[RemoveRdd]] RemoveRdd

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_11","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        RemoveRdd( rddId: Int)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[RemoveShuffle]] RemoveShuffle

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_12","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        RemoveShuffle( shuffleId: Int)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[RemoveBroadcast]] RemoveBroadcast

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_13","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        RemoveBroadcast( broadcastId: Long, removeFromDriver: Boolean = true)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[RemoveBlock]] RemoveBlock

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_14","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        RemoveBlock( blockId: BlockId)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[RemoveExecutor]] RemoveExecutor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_15","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        RemoveExecutor( execId: String)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint <execId is removed>> and the response true sent back.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when BlockManagerMaster.md#removeExecutor[BlockManagerMaster removes an executor].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[StopBlockManagerMaster]] StopBlockManagerMaster

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_16","title":"[source, scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#stopblockmanagermaster","title":"StopBlockManagerMaster","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[UpdateBlockInfo]] UpdateBlockInfo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_17","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        UpdateBlockInfo( blockManagerId: BlockManagerId, blockId: BlockId, storageLevel: StorageLevel, memSize: Long, diskSize: Long)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When received, BlockManagerMasterEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Posted when BlockManagerMaster is requested to storage:BlockManagerMaster.md#updateBlockInfo[handle a block status update (from BlockManager on an executor)].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[storageStatus]] storageStatus Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_1","title":"[source,scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#storagestatus-arraystoragestatus","title":"storageStatus: Array[StorageStatus]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        storageStatus...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        storageStatus is used when BlockManagerMasterEndpoint is requested to handle <> message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[getLocationsMultipleBlockIds]] getLocationsMultipleBlockIds Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_2","title":"[source,scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getLocationsMultipleBlockIds( blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getLocationsMultipleBlockIds...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getLocationsMultipleBlockIds is used when BlockManagerMasterEndpoint is requested to handle <> message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[removeShuffle]] removeShuffle Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_3","title":"[source,scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        removeShuffle( shuffleId: Int): Future[Seq[Boolean]]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        removeShuffle...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        removeShuffle is used when BlockManagerMasterEndpoint is requested to handle <> message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[getPeers]] getPeers Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_18","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getPeers( blockManagerId: BlockManagerId): Seq[BlockManagerId]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getPeers finds all the registered BlockManagers (using <> internal registry) and checks if the input blockManagerId is amongst them.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        If the input blockManagerId is registered, getPeers returns all the registered BlockManagers but the one on the driver and blockManagerId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Otherwise, getPeers returns no BlockManagers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: Peers of a storage:BlockManager.md[BlockManager] are the other BlockManagers in a cluster (except the driver's BlockManager). Peers are used to know the available executors in a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getPeers is used when BlockManagerMasterEndpoint is requested to handle <> message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[register]] register Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_19","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        register( idWithoutTopologyInfo: BlockManagerId, maxOnHeapMemSize: Long, maxOffHeapMemSize: Long, slaveEndpoint: RpcEndpointRef): BlockManagerId

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        register registers a storage:BlockManager.md[] (based on the given storage:BlockManagerId.md[]) in the <> and <> registries and posts a SparkListenerBlockManagerAdded message (to the <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: The input maxMemSize is the storage:BlockManager.md#maxMemory[total available on-heap and off-heap memory for storage on a BlockManager].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: Registering a BlockManager can only happen once for an executor (identified by BlockManagerId.executorId in <> internal registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        If another BlockManager has earlier been registered for the executor, you should see the following ERROR message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#got-two-different-block-manager-registrations-on-same-executor-will-replace-old-one-oldid-with-new-one-id","title":"Got two different block manager registrations on same executor - will replace old one [oldId] with new one [id]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        And then <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        register prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext_2","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#registering-block-manager-hostport-with-bytes-ram-id","title":"Registering block manager [hostPort] with [bytes] RAM, [id]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The BlockManager is recorded in the internal registries:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • <>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In the end, register requests the <> to scheduler:LiveListenerBus.md#post[post] a SparkListener.md#SparkListenerBlockManagerAdded[SparkListenerBlockManagerAdded] message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          register is used when BlockManagerMasterEndpoint is requested to handle <> message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[removeExecutor]] removeExecutor Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_20","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          removeExecutor( execId: String): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          removeExecutor prints the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext_3","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#trying-to-remove-executor-execid-from-blockmanagermaster","title":"Trying to remove executor [execId] from BlockManagerMaster.","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          If the execId executor is registered (in the internal <> internal registry), removeExecutor <BlockManager>>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          removeExecutor is used when BlockManagerMasterEndpoint is requested to handle <> or <> messages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[removeBlockManager]] removeBlockManager Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_21","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          removeBlockManager( blockManagerId: BlockManagerId): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          removeBlockManager looks up blockManagerId and removes the executor it was working on from the internal registries:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • <>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            It then goes over all the blocks for the BlockManager, and removes the executor for each block from blockLocations registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkListener.md#SparkListenerBlockManagerRemoved[SparkListenerBlockManagerRemoved(System.currentTimeMillis(), blockManagerId)] is posted to SparkContext.md#listenerBus[listenerBus].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            You should then see the following INFO message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext_4","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#removing-block-manager-blockmanagerid","title":"Removing block manager [blockManagerId]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            removeBlockManager is used when BlockManagerMasterEndpoint is requested to <> (to handle <> or <> messages).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[getLocations]] getLocations Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_22","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getLocations( blockId: BlockId): Seq[BlockManagerId]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getLocations looks up the given storage:BlockId.md[] in the blockLocations internal registry and returns the locations (as a collection of BlockManagerId) or an empty collection.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getLocations is used when BlockManagerMasterEndpoint is requested to handle <> and <> messages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[logging]] Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Enable ALL logging level for org.apache.spark.storage.BlockManagerMasterEndpoint logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#source","title":"[source]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#log4jloggerorgapachesparkstorageblockmanagermasterendpointall","title":"log4j.logger.org.apache.spark.storage.BlockManagerMasterEndpoint=ALL","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Refer to spark-logging.md[Logging].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[internal-properties]] Internal Properties

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[blockManagerIdByExecutor]] blockManagerIdByExecutor Lookup Table

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_4","title":"[source,scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#blockmanageridbyexecutor-mapstring-blockmanagerid","title":"blockManagerIdByExecutor: Map[String, BlockManagerId]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Lookup table of storage:BlockManagerId.md[]s by executor ID

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A new executor is added when BlockManagerMasterEndpoint is requested to handle a <> message (and <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            An executor is removed when BlockManagerMasterEndpoint is requested to handle a <> and a <> messages (via <>)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used when BlockManagerMasterEndpoint is requested to handle <> message, <>, <> and <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[blockManagerInfo]] blockManagerInfo Lookup Table

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_5","title":"[source,scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#blockmanageridbyexecutor-mapstring-blockmanagerid_1","title":"blockManagerIdByExecutor: Map[String, BlockManagerId]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Lookup table of storage:BlockManagerInfo.md[] by storage:BlockManagerId.md[]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A new BlockManagerInfo is added when BlockManagerMasterEndpoint is requested to handle a <> message (and <>).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A BlockManagerInfo is removed when BlockManagerMasterEndpoint is requested to <> (to handle <> and <> messages).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[blockLocations]] blockLocations

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_6","title":"[source,scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#blocklocations-mapblockid-setblockmanagerid","title":"blockLocations: Map[BlockId, Set[BlockManagerId]]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Collection of storage:BlockId.md[] and their locations (as BlockManagerId).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Used in removeRdd to remove blocks for a RDD, removeBlockManager to remove blocks after a BlockManager gets removed, removeBlockFromWorkers, updateBlockInfo, and <>."},{"location":"storage/BlockManagerMasterHeartbeatEndpoint/","title":"BlockManagerMasterHeartbeatEndpoint","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockManagerMasterHeartbeatEndpoint is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerSlaveEndpoint/","title":"BlockManagerSlaveEndpoint","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockManagerSlaveEndpoint is a ThreadSafeRpcEndpoint for BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerSlaveEndpoint/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockManagerSlaveEndpoint takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • [[rpcEnv]] rpc:RpcEnv.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • [[blockManager]] Parent BlockManager.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • [[mapOutputTracker]] scheduler:MapOutputTracker.md[]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockManagerSlaveEndpoint is created for BlockManager.md#slaveEndpoint[BlockManager] (and registered under the name BlockManagerEndpoint[ID]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[messages]] Messages

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[GetBlockStatus]] GetBlockStatus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            GetBlockStatus( blockId: BlockId, askSlaves: Boolean = true)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerSlaveEndpoint requests the <> for the BlockManager.md#getStatus[status of a given block] (by BlockId.md[]) and sends it back to a sender.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[GetMatchingBlockIds]] GetMatchingBlockIds

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            GetMatchingBlockIds( filter: BlockId => Boolean, askSlaves: Boolean = true)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerSlaveEndpoint requests the <> to storage:BlockManager.md#getMatchingBlockIds[find IDs of existing blocks for a given filter] and sends them back to a sender.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[RemoveBlock]] RemoveBlock

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_2","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            RemoveBlock( blockId: BlockId)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerSlaveEndpoint prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerSlaveEndpoint/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerSlaveEndpoint/#removing-block-blockid","title":"removing block [blockId]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockManagerSlaveEndpoint then <blockId block>>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When the computation is successful, you should see the following DEBUG in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Done removing block [blockId], response is [response]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            And true response is sent back. You should see the following DEBUG in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Sent response: true to [senderAddress]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In case of failure, you should see the following ERROR in the logs and the stack trace.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Error in removing block [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[RemoveBroadcast]] RemoveBroadcast

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            RemoveBroadcast( broadcastId: Long, removeFromDriver: Boolean = true)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerSlaveEndpoint prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerSlaveEndpoint/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerSlaveEndpoint/#removing-broadcast-broadcastid","title":"removing broadcast [broadcastId]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            It then calls <broadcastId broadcast>>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When the computation is successful, you should see the following DEBUG in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Done removing broadcast [broadcastId], response is [response]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            And the result is sent back. You should see the following DEBUG in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Sent response: [response] to [senderAddress]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In case of failure, you should see the following ERROR in the logs and the stack trace.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Error in removing broadcast [broadcastId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[RemoveRdd]] RemoveRdd

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_4","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            RemoveRdd( rddId: Int)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerSlaveEndpoint prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            removing RDD [rddId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            It then calls <rddId RDD>>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: Handling RemoveRdd messages happens on a separate thread. See <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When the computation is successful, you should see the following DEBUG in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Done removing RDD [rddId], response is [response]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            And the number of blocks removed is sent back. You should see the following DEBUG in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Sent response: [#blocks] to [senderAddress]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In case of failure, you should see the following ERROR in the logs and the stack trace.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Error in removing RDD [rddId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[RemoveShuffle]] RemoveShuffle

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_5","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            RemoveShuffle( shuffleId: Int)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerSlaveEndpoint prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            removing shuffle [shuffleId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If scheduler:MapOutputTracker.md[MapOutputTracker] was given (when the RPC endpoint was created), it calls scheduler:MapOutputTracker.md#unregisterShuffle[MapOutputTracker to unregister the shuffleId shuffle].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            It then calls shuffle:ShuffleManager.md#unregisterShuffle[ShuffleManager to unregister the shuffleId shuffle].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: Handling RemoveShuffle messages happens on a separate thread. See <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When the computation is successful, you should see the following DEBUG in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Done removing shuffle [shuffleId], response is [response]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            And the result is sent back. You should see the following DEBUG in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Sent response: [response] to [senderAddress]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In case of failure, you should see the following ERROR in the logs and the stack trace.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Error in removing shuffle [shuffleId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posted when BlockManagerMaster.md#removeShuffle[BlockManagerMaster] and storage:BlockManagerMasterEndpoint.md#removeShuffle[BlockManagerMasterEndpoint] are requested to remove all blocks of a shuffle.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[ReplicateBlock]] ReplicateBlock

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_6","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ReplicateBlock( blockId: BlockId, replicas: Seq[BlockManagerId], maxReplicas: Int)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerSlaveEndpoint...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Posted when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[TriggerThreadDump]] TriggerThreadDump

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When received, BlockManagerSlaveEndpoint is requested for the thread info for all live threads with stack trace and synchronization information.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[asyncThreadPool]][[asyncExecutionContext]] block-manager-slave-async-thread-pool Thread Pool

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockManagerSlaveEndpoint creates a thread pool of maximum 100 daemon threads with block-manager-slave-async-thread-pool thread prefix (using {java-javadoc-url}/java/util/concurrent/ThreadPoolExecutor.html[java.util.concurrent.ThreadPoolExecutor]).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockManagerSlaveEndpoint uses the thread pool (as a Scala implicit value) when requested to <> to communicate in a non-blocking, asynchronous way.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The thread pool is shut down when BlockManagerSlaveEndpoint is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The reason for the async thread pool is that the block-related operations might take quite some time and to release the main RPC thread other threads are spawned to talk to the external services and pass responses on to the clients.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[doAsync]] doAsync Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerSlaveEndpoint/#sourcescala","title":"[source,scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            doAsyncT( body: => T)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            doAsync creates a Scala Future to execute the following asynchronously (i.e. on a separate thread from the <>):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            . Prints out the given actionMessage as a DEBUG message to the logs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            . Executes the given body

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When completed successfully, doAsync prints out the following DEBUG messages to the logs and requests the given RpcCallContext to reply the response to the sender.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerSlaveEndpoint/#sourceplaintext_2","title":"[source,plaintext]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Done [actionMessage], response is [response] Sent response: [response] to [senderAddress]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In case of a failure, doAsync prints out the following ERROR message to the logs and requests the given RpcCallContext to send the failure to the sender.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerSlaveEndpoint/#sourceplaintext_3","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerSlaveEndpoint/#error-in-actionmessage","title":"Error in [actionMessage]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            doAsync is used when BlockManagerSlaveEndpoint is requested to handle <>, <>, <> and <> messages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            == [[logging]] Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Enable ALL logging level for org.apache.spark.storage.BlockManagerSlaveEndpoint logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerSlaveEndpoint/#source","title":"[source]","text":""},{"location":"storage/BlockManagerSlaveEndpoint/#log4jloggerorgapachesparkstorageblockmanagerslaveendpointall","title":"log4j.logger.org.apache.spark.storage.BlockManagerSlaveEndpoint=ALL","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Refer to spark-logging.md[Logging].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerSource/","title":"BlockManagerSource -- Metrics Source for BlockManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockManagerSource is the spark-metrics-Source.md[metrics source] of a storage:BlockManager.md[BlockManager].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            [[sourceName]] BlockManagerSource is registered under the name BlockManager (when SparkContext is created).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            [[metrics]] .BlockManagerSource's Gauge Metrics (in alphabetical order) [width=\"100%\",cols=\"1,1,2\",options=\"header\"] |=== | Name | Type | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | disk.diskSpaceUsed_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their disk space used (diskUsed).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | memory.maxMem_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their maximum memory limit (maxMem).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | memory.maxOffHeapMem_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their off-heap memory remaining (maxOffHeapMem).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | memory.maxOnHeapMem_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their on-heap memory remaining (maxOnHeapMem).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | memory.memUsed_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their memory used (memUsed).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | memory.offHeapMemUsed_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their off-heap memory used (offHeapMemUsed).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | memory.onHeapMemUsed_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their on-heap memory used (onHeapMemUsed).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | memory.remainingMem_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their memory remaining (memRemaining).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | memory.remainingOffHeapMem_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their off-heap memory remaining (offHeapMemRemaining).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | memory.remainingOnHeapMem_MB | long | Requests BlockManagerMaster for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their on-heap memory remaining (onHeapMemRemaining). |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            You can access the BlockManagerSource <> using the web UI's port (as spark-webui-properties.md#spark.ui.port[spark.ui.port] configuration property).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ http --follow http://localhost:4040/metrics/json \\\n    | jq '.gauges | keys | .[] | select(test(\".driver.BlockManager\"; \"g\"))'\n\"local-1528725411625.driver.BlockManager.disk.diskSpaceUsed_MB\"\n\"local-1528725411625.driver.BlockManager.memory.maxMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.maxOffHeapMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.maxOnHeapMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.memUsed_MB\"\n\"local-1528725411625.driver.BlockManager.memory.offHeapMemUsed_MB\"\n\"local-1528725411625.driver.BlockManager.memory.onHeapMemUsed_MB\"\n\"local-1528725411625.driver.BlockManager.memory.remainingMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.remainingOffHeapMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.remainingOnHeapMem_MB\"\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            [[creating-instance]] [[blockManager]] BlockManagerSource takes a storage:BlockManager.md[BlockManager] when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockManagerSource is created when SparkContext is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerStorageEndpoint/","title":"BlockManagerStorageEndpoint","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockManagerStorageEndpoint is an IsolatedRpcEndpoint.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/BlockManagerStorageEndpoint/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            BlockManagerStorageEndpoint takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • RpcEnv
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MapOutputTracker

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BlockManagerStorageEndpoint is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"storage/BlockManagerStorageEndpoint/#messages","title":"Messages","text":""},{"location":"storage/BlockManagerStorageEndpoint/#decommissionblockmanager","title":"DecommissionBlockManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              When received, receiveAndReply requests the BlockManager to decommissionSelf.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              DecommissionBlockManager is sent out when BlockManager is requested to decommissionBlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"storage/BlockReplicationPolicy/","title":"BlockReplicationPolicy","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BlockReplicationPolicy is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"storage/BlockStoreClient/","title":"BlockStoreClient","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BlockStoreClient is an abstraction of block clients that can fetch blocks from a remote node (an executor or an external service).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BlockStoreClient is a Java Closeable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BlockStoreClient was known previously as ShuffleClient (SPARK-28593).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"storage/BlockStoreClient/#contract","title":"Contract","text":""},{"location":"storage/BlockStoreClient/#fetching-blocks","title":"Fetching Blocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              void fetchBlocks(\n  String host,\n  int port,\n  String execId,\n  String[] blockIds,\n  BlockFetchingListener listener,\n  DownloadFileManager downloadFileManager)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Fetches blocks from a remote node (using DownloadFileManager)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockTransferService is requested to fetchBlockSync
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ShuffleBlockFetcherIterator is requested to sendRequest
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"storage/BlockStoreClient/#shuffle-metrics","title":"Shuffle Metrics
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              MetricSet shuffleMetrics()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Shuffle MetricsSet

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Default: (empty)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockManager is requested for the Shuffle Metrics Source
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"storage/BlockStoreClient/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockTransferService
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ExternalBlockStoreClient
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"storage/BlockStoreUpdater/","title":"BlockStoreUpdater","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BlockStoreUpdater is an abstraction of block store updaters that store blocks (from bytes, whether they start in memory or on disk).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BlockStoreUpdater is an internal class of BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"storage/BlockStoreUpdater/#contract","title":"Contract","text":""},{"location":"storage/BlockStoreUpdater/#block-data","title":"Block Data
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              blockData(): BlockData\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BlockData

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockStoreUpdater is requested to save
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • TempFileBasedBlockStoreUpdater is requested to readToByteBuffer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"storage/BlockStoreUpdater/#readtobytebuffer","title":"readToByteBuffer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              readToByteBuffer(): ChunkedByteBuffer\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockStoreUpdater is requested to save
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"storage/BlockStoreUpdater/#storing-block-to-disk","title":"Storing Block to Disk
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              saveToDiskStore(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockStoreUpdater is requested to save
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"storage/BlockStoreUpdater/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ByteBufferBlockStoreUpdater
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • TempFileBasedBlockStoreUpdater
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"storage/BlockStoreUpdater/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BlockStoreUpdater takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Block Size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • BlockId
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • StorageLevel
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Scala's ClassTag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • tellMaster flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • keepReadLock flag Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockStoreUpdater\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete BlockStoreUpdaters.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockStoreUpdater/#saving-block-to-block-store","title":"Saving Block to Block Store
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                save(): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                save doPut with the putBody function.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                save\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to putBlockDataAsStream and store block bytes locally
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockStoreUpdater/#putbody-function","title":"putBody Function

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                With the StorageLevel with replication (above 1), the putBody function triggers replication concurrently (using a Future (Scala) on a separate thread from the ExecutionContextExecutorService).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In general, putBody stores the block in the MemoryStore first (if requested based on useMemory of the StorageLevel). putBody saves to a DiskStore (if useMemory is not specified or storing to the MemoryStore failed).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                putBody stores the block in the MemoryStore only even if the useMemory and useDisk flags could both be turned on (true).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Spark drops the block to disk later if the memory store can't hold it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                With the useMemory of the StorageLevel set, putBody saveDeserializedValuesToMemoryStore for deserialized storage level or saveSerializedValuesToMemoryStore otherwise.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                putBody saves to a DiskStore when either of the following happens:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                1. Storing in memory fails and the useDisk (of the StorageLevel) is set
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2. useMemory of the StorageLevel is not set yet the useDisk is

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                putBody getCurrentBlockStatus and checks if it is in either the memory or disk store.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In the end, putBody reportBlockStatus (if the given tellMaster flag and the tellMaster flag of the BlockInfo are both enabled) and addUpdatedBlockStatusToTaskMetrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                putBody prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Put block [blockId] locally took [timeUsed] ms\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                putBody prints out the following WARN message to the logs when an attempt to store a block in memory fails and the useDisk is set:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Persisting block [blockId] to disk instead.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockStoreUpdater/#saving-deserialized-values-to-memorystore","title":"Saving Deserialized Values to MemoryStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                saveDeserializedValuesToMemoryStore(\n  inputStream: InputStream): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                saveDeserializedValuesToMemoryStore...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                saveDeserializedValuesToMemoryStore\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockStoreUpdater is requested to save a block (with memory deserialized storage level)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockStoreUpdater/#saving-serialized-values-to-memorystore","title":"Saving Serialized Values to MemoryStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                saveSerializedValuesToMemoryStore(\n  bytes: ChunkedByteBuffer): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                saveSerializedValuesToMemoryStore...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                saveSerializedValuesToMemoryStore\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockStoreUpdater is requested to save a block (with memory serialized storage level)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockStoreUpdater/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockStoreUpdater is an abstract class and logging is configured using the logger of the implementations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockTransferService/","title":"BlockTransferService","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockTransferService is an extension of the BlockStoreClient abstraction for shuffle clients that can fetch and upload blocks of data (synchronously or asynchronously).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockTransferService is a network service available by a host name and a port.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BlockTransferService was introduced in SPARK-3019 Pluggable block transfer interface (BlockTransferService).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockTransferService/#contract","title":"Contract","text":""},{"location":"storage/BlockTransferService/#host-name","title":"Host Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                hostName: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Host name this service is listening on

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockTransferService/#initializing","title":"Initializing
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                init(\n  blockDataManager: BlockDataManager): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockTransferService/#port","title":"Port
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                port: Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockTransferService/#uploading-block-asynchronously","title":"Uploading Block Asynchronously
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                uploadBlock(\n  hostname: String,\n  port: Int,\n  execId: String,\n  blockId: BlockId,\n  blockData: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Future[Unit]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockTransferService is requested to uploadBlockSync
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/BlockTransferService/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • NettyBlockTransferService
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/BlockTransferService/#uploading-block-synchronously","title":"Uploading Block Synchronously
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                uploadBlockSync(\n  hostname: String,\n  port: Int,\n  execId: String,\n  blockId: BlockId,\n  blockData: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                uploadBlockSync uploadBlock and waits till it finishes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                uploadBlockSync\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockManager is requested to replicate
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ShuffleMigrationRunnable is requested to run
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/ByteBufferBlockStoreUpdater/","title":"ByteBufferBlockStoreUpdater","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ByteBufferBlockStoreUpdater is a BlockStoreUpdater (that BlockManager uses for storing a block from bytes already in memory).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/ByteBufferBlockStoreUpdater/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ByteBufferBlockStoreUpdater takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockId
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • StorageLevel
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ClassTag (Scala)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • ChunkedByteBuffer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • tellMaster flag (default: true)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • keepReadLock flag (default: false)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ByteBufferBlockStoreUpdater is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManager is requested to store a block (bytes) locally
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"storage/ByteBufferBlockStoreUpdater/#block-data","title":"Block Data
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  blockData(): BlockData\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  blockData creates a ByteBufferBlockData (with the ChunkedByteBuffer).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  blockData\u00a0is part of the BlockStoreUpdater abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/ByteBufferBlockStoreUpdater/#readtobytebuffer","title":"readToByteBuffer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  readToByteBuffer(): ChunkedByteBuffer\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  readToByteBuffer simply gives the ChunkedByteBuffer (it was created with).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  readToByteBuffer\u00a0is part of the BlockStoreUpdater abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/ByteBufferBlockStoreUpdater/#storing-block-to-disk","title":"Storing Block to Disk
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  saveToDiskStore(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  saveToDiskStore requests the DiskStore (of the parent BlockManager) to putBytes (with the BlockId and the ChunkedByteBuffer).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  saveToDiskStore\u00a0is part of the BlockStoreUpdater abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/DiskBlockManager/","title":"DiskBlockManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  DiskBlockManager manages a logical mapping of logical blocks and their physical on-disk locations for a BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  By default, one block is mapped to one file with a name given by BlockId. It is however possible to have a block to be mapped to a segment of a file only.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Block files are hashed among the local directories.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  DiskBlockManager is used to create a DiskStore.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"storage/DiskBlockManager/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  DiskBlockManager takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • deleteFilesOnStop flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    When created, DiskBlockManager creates the local directories for block storage and initializes the internal subDirs collection of locks for every local directory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DiskBlockManager createLocalDirsForMergedShuffleBlocks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In the end, DiskBlockManager registers a shutdown hook to clean up the local directories for blocks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DiskBlockManager is created for BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"storage/DiskBlockManager/#createlocaldirsformergedshuffleblocks","title":"createLocalDirsForMergedShuffleBlocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    createLocalDirsForMergedShuffleBlocks(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    createLocalDirsForMergedShuffleBlocks is a noop with isPushBasedShuffleEnabled disabled (YARN mode only).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    createLocalDirsForMergedShuffleBlocks...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/DiskBlockManager/#accessing-diskblockmanager","title":"Accessing DiskBlockManager","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DiskBlockManager is available using SparkEnv.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    org.apache.spark.SparkEnv.get.blockManager.diskBlockManager\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"storage/DiskBlockManager/#local-directories-for-block-storage","title":"Local Directories for Block Storage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DiskBlockManager creates blockmgr directory in every local root directory when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DiskBlockManager uses localDirs internal registry of all the blockmgr directories.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DiskBlockManager expects at least one local directory or prints out the following ERROR message to the logs and exits the JVM (with exit code 53):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Failed to create any local dir.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    localDirs is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • DiskBlockManager is created (and creates localDirsString and subDirs), requested to look up a file (among local subdirectories) and doStop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BlockManager is requested to register with an external shuffle server
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BasePythonRunner (PySpark) is requested to compute
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/DiskBlockManager/#localdirsstring","title":"localDirsString

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DiskBlockManager uses localDirsString internal registry of the paths of the local blockmgr directories.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    localDirsString is used by BlockManager when requested for getLocalDiskDirs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/DiskBlockManager/#creating-blockmgr-directory-in-every-local-root-directory","title":"Creating blockmgr Directory in Every Local Root Directory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    createLocalDirs(\n  conf: SparkConf): Array[File]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    createLocalDirs creates blockmgr local directories for storing block data.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    createLocalDirs creates a blockmgr-[randomUUID] directory under every root directory for local storage and prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Created local directory at [localDir]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In case of an exception, createLocalDirs prints out the following ERROR message to the logs and ignore the directory:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Failed to create local dir in [rootDir]. Ignoring this directory.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/DiskBlockManager/#file-locks-for-local-block-store-directories","title":"File Locks for Local Block Store Directories
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    subDirs: Array[Array[File]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    subDirs is a lookup table for file locks of every local block directory (with the first dimension for local directories and the second for locks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The number of block subdirectories is controlled by spark.diskStore.subDirectories configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    subDirs(dirId)(subDirId) is used to access subDirId subdirectory in dirId local directory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    subDirs is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • DiskBlockManager is requested for a block file and all the block files
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/DiskBlockManager/#finding-block-file-and-creating-parent-directories","title":"Finding Block File (and Creating Parent Directories)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getFile(\n  blockId: BlockId): File\ngetFile(\n  filename: String): File\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getFile computes a hash of the file name of the input BlockId that is used for the name of the parent directory and subdirectory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getFile creates the subdirectory unless it already exists.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getFile is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • DiskBlockManager is requested to containsBlock, createTempLocalBlock, createTempShuffleBlock

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • DiskStore is requested to getBytes, remove, contains, and put

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • IndexShuffleBlockResolver is requested to getDataFile and getIndexFile

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/DiskBlockManager/#createtempshuffleblock","title":"createTempShuffleBlock
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    createTempShuffleBlock(): (TempShuffleBlockId, File)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    createTempShuffleBlock creates a temporary TempShuffleBlockId block.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    createTempShuffleBlock...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/DiskBlockManager/#registering-shutdown-hook","title":"Registering Shutdown Hook
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    addShutdownHook(): AnyRef\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    addShutdownHook registers a shutdown hook to execute doStop at shutdown.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    When executed, you should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Adding shutdown hook\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    addShutdownHook adds the shutdown hook so it prints the following INFO message and executes doStop:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Shutdown hook called\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/DiskBlockManager/#getting-writable-directories-in-yarn","title":"Getting Writable Directories in YARN
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getYarnLocalDirs(\n  conf: SparkConf): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getYarnLocalDirs uses conf SparkConf to read LOCAL_DIRS environment variable with comma-separated local directories (that have already been created and secured so that only the user has access to them).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getYarnLocalDirs throws an Exception when LOCAL_DIRS environment variable was not set:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Yarn Local dirs can't be empty\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/DiskBlockManager/#checking-whether-spark-runs-on-yarn","title":"Checking Whether Spark Runs on YARN
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    isRunningInYarnContainer(\n  conf: SparkConf): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    isRunningInYarnContainer uses conf SparkConf to read Hadoop YARN's CONTAINER_ID environment variable to find out if Spark runs in a YARN container (that is exported by YARN NodeManager).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/DiskBlockManager/#getting-all-blocks-from-files-stored-on-disk","title":"Getting All Blocks (From Files Stored On Disk)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getAllBlocks(): Seq[BlockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getAllBlocks gets all the blocks stored on disk.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Internally, getAllBlocks takes the block files and returns their names (as BlockId).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getAllBlocks is used when BlockManager is requested to find IDs of existing blocks for a given filter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/DiskBlockManager/#all-block-files","title":"All Block Files
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getAllFiles(): Seq[File]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    getAllFiles uses the subDirs registry to list all the files (in all the directories) that are currently stored on disk by this disk manager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/DiskBlockManager/#stopping","title":"Stopping
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    stop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    stop...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    stop is used when BlockManager is requested to stop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/DiskBlockManager/#stopping-diskblockmanager-removing-local-directories-for-blocks","title":"Stopping DiskBlockManager (Removing Local Directories for Blocks)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    doStop(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    doStop deletes the local directories recursively (only when the constructor's deleteFilesOnStop is enabled and the parent directories are not registered to be removed at shutdown).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    doStop is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • DiskBlockManager is requested to shut down or stop
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/DiskBlockManager/#demo","title":"Demo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Demo: DiskBlockManager and Block Data

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/DiskBlockManager/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Enable ALL logging level for org.apache.spark.storage.DiskBlockManager logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    log4j.logger.org.apache.spark.storage.DiskBlockManager=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/DiskBlockObjectWriter/","title":"DiskBlockObjectWriter","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DiskBlockObjectWriter is a disk writer of BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DiskBlockObjectWriter is an OutputStream (Java) that BlockManager offers for writing data blocks to disk.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DiskBlockObjectWriter is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BypassMergeSortShuffleWriter is requested for partition writers

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • UnsafeSorterSpillWriter is requested for a partition writer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleExternalSorter is requested to writeSortedFile

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ExternalSorter is requested to spillMemoryIteratorToDisk

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"storage/DiskBlockObjectWriter/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DiskBlockObjectWriter takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • File (Java)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SerializerManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SerializerInstance
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Buffer size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • syncWrites flag (based on spark.shuffle.sync configuration property)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleWriteMetricsReporter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BlockId (default: null)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      DiskBlockObjectWriter is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • BlockManager is requested for a disk writer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"storage/DiskBlockObjectWriter/#buffer-size","title":"Buffer Size

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      DiskBlockObjectWriter is given a buffer size when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The buffer size is specified by BlockManager and is based on spark.shuffle.file.buffer configuration property (in most cases) or is hardcoded to 32k (in some cases but is in fact the default value).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The buffer size is exactly the buffer size of the BufferedOutputStream.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskBlockObjectWriter/#serializationstream","title":"SerializationStream

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      DiskBlockObjectWriter manages a SerializationStream for writing a key-value record:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Opens it when requested to open

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Closes it when requested to commitAndGet

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Dereferences it (nulls it) when closeResources

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskBlockObjectWriter/#states","title":"States

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      DiskBlockObjectWriter can be in one of the following states (that match the state of the underlying output streams):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Initialized
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Open
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Closed
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskBlockObjectWriter/#writing-out-record","title":"Writing Out Record
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      write(\n  key: Any,\n  value: Any): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      write opens the underlying stream unless open already.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      write requests the SerializationStream to write the key and then the value.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In the end, write updates the write metrics.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      write is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • BypassMergeSortShuffleWriter is requested to write records of a partition

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ExternalAppendOnlyMap is requested to spillMemoryIteratorToDisk

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ExternalSorter is requested to write all records into a partitioned file

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SpillableIterator is requested to spill
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • WritablePartitionedPairCollection is requested for a destructiveSortedWritablePartitionedIterator

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskBlockObjectWriter/#commitandget","title":"commitAndGet
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      commitAndGet(): FileSegment\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      With streamOpen enabled, commitAndGet...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Otherwise, commitAndGet returns a new FileSegment (with the File, committedPosition and 0 length).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      commitAndGet is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • BypassMergeSortShuffleWriter is requested to write
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ShuffleExternalSorter is requested to writeSortedFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • DiskBlockObjectWriter is requested to close
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ExternalAppendOnlyMap is requested to spillMemoryIteratorToDisk
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ExternalSorter is requested to spillMemoryIteratorToDisk, writePartitionedFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • UnsafeSorterSpillWriter is requested to close
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskBlockObjectWriter/#committing-writes-and-closing-resources","title":"Committing Writes and Closing Resources
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      close(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Only if initialized, close commitAndGet followed by closeResources. Otherwise, close does nothing.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      close is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • FIXME
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskBlockObjectWriter/#revertpartialwritesandclose","title":"revertPartialWritesAndClose
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      revertPartialWritesAndClose(): File\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      revertPartialWritesAndClose...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      revertPartialWritesAndClose is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskBlockObjectWriter/#writing-bytes-from-byte-array-starting-from-offset","title":"Writing Bytes (From Byte Array Starting From Offset)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      write(\n  kvBytes: Array[Byte],\n  offs: Int,\n  len: Int): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      write...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      write is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskBlockObjectWriter/#opening-diskblockobjectwriter","title":"Opening DiskBlockObjectWriter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      open(): DiskBlockObjectWriter\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      open opens the DiskBlockObjectWriter, i.e. initializes and re-sets bs and objOut internal output streams.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Internally, open makes sure that DiskBlockObjectWriter is not closed (hasBeenClosed flag is disabled). If it was, open throws a IllegalStateException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Writer already closed. Cannot be reopened.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Unless DiskBlockObjectWriter has already been initialized (initialized flag is enabled), open initializes it (and turns initialized flag on).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Regardless of whether DiskBlockObjectWriter was already initialized or not, open requests SerializerManager to wrap mcs output stream for encryption and compression (for blockId) and sets it as bs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      open requests the SerializerInstance to serialize bs output stream and sets it as objOut.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      open uses the SerializerInstance that was used to create the DiskBlockObjectWriter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In the end, open turns streamOpen flag on.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      open is used when DiskBlockObjectWriter writes out a record or bytes from a specified byte array and the stream is not open yet.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskBlockObjectWriter/#initialization","title":"Initialization
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      initialize(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      initialize creates a FileOutputStream to write to the file (with theappend enabled) and takes the FileChannel associated with this file output stream.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      initialize creates a TimeTrackingOutputStream (with the ShuffleWriteMetricsReporter and the FileOutputStream).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      With checksumEnabled, initialize...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In the end, initialize creates a BufferedOutputStream.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskBlockObjectWriter/#checksumenabled-flag","title":"checksumEnabled Flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      DiskBlockObjectWriter defines checksumEnabled flag to...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      checksumEnabled is false by default and can be enabled using setChecksum.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskBlockObjectWriter/#setchecksum","title":"setChecksum
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      setChecksum(\n  checksum: Checksum): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      setChecksum...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      setChecksum is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • BypassMergeSortShuffleWriter is requested to write records (with spark.shuffle.checksum.enabled enabled)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ShuffleExternalSorter is requested to writeSortedFile (with spark.shuffle.checksum.enabled enabled)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskBlockObjectWriter/#recording-bytes-written","title":"Recording Bytes Written
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      recordWritten(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      recordWritten increases the numRecordsWritten counter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      recordWritten requests the ShuffleWriteMetricsReporter to incRecordsWritten.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      recordWritten updates the bytes written metric every 16384 bytes written (based on the numRecordsWritten counter).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      recordWritten is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ShuffleExternalSorter is requested to writeSortedFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • DiskBlockObjectWriter is requested to write
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • UnsafeSorterSpillWriter is requested to write
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskBlockObjectWriter/#updating-bytes-written-metric","title":"Updating Bytes Written Metric
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      updateBytesWritten(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      updateBytesWritten requests the FileChannel for the file position (i.e., the number of bytes from the beginning of the file to the current position) that is used to incBytesWritten (using the ShuffleWriteMetricsReporter and the reportedPosition counter).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In the end, updateBytesWritten updates the reportedPosition counter to the current file position (so it can report incBytesWritten properly).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskBlockObjectWriter/#bufferedoutputstream","title":"BufferedOutputStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      mcs: ManualCloseOutputStream\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      DiskBlockObjectWriter creates a custom BufferedOutputStream (Java) when requested to initialize.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The BufferedOutputStream is closed (and dereferenced) in closeResources.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The BufferedOutputStream is used to create the OutputStream when requested to open.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskBlockObjectWriter/#outputstream","title":"OutputStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      bs: OutputStream\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      DiskBlockObjectWriter creates an OutputStream when requested to open. The OutputStream can be encrypted and compressed if enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The OutputStream is closed (and dereferenced) in closeResources.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The OutputStream is used to create the SerializationStream when requested to open.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The OutputStream is requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Write bytes out in write
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Flush in flush (and commitAndGet)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/DiskStore/","title":"DiskStore","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      DiskStore manages data blocks on disk for BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"storage/DiskStore/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      DiskStore takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • DiskBlockManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SecurityManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        DiskStore is created\u00a0for BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/DiskStore/#block-sizes","title":"Block Sizes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        blockSizes: ConcurrentHashMap[BlockId, Long]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        DiskStore uses ConcurrentHashMap (Java) as a registry of blocks and the data size (on disk).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        A new entry is added when put and moveFileToBlock.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        An entry is removed when remove.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskStore/#putbytes","title":"putBytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        putBytes(\n  blockId: BlockId,\n  bytes: ChunkedByteBuffer): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        putBytes put the block and writes the buffer out (to the given channel).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        putBytes\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ByteBufferBlockStoreUpdater is requested to saveToDiskStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to dropFromMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskStore/#getbytes","title":"getBytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getBytes(\n  blockId: BlockId): BlockData\ngetBytes(\n  f: File,\n  blockSize: Long): BlockData\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getBytes requests the DiskBlockManager for the block file and the size.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getBytes requests the SecurityManager for getIOEncryptionKey and returns a EncryptedBlockData if available or a DiskBlockData otherwise.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getBytes\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TempFileBasedBlockStoreUpdater is requested to blockData
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to getLocalValues, doGetLocalBytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskStore/#getsize","title":"getSize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getSize(\n  blockId: BlockId): Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getSize looks up the block in the blockSizes registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getSize\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to getStatus, getCurrentBlockStatus, doPutIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DiskStore is requested for the block bytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskStore/#movefiletoblock","title":"moveFileToBlock
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        moveFileToBlock(\n  sourceFile: File,\n  blockSize: Long,\n  targetBlockId: BlockId): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        moveFileToBlock...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        moveFileToBlock\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TempFileBasedBlockStoreUpdater is requested to saveToDiskStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskStore/#checking-if-block-file-exists","title":"Checking if Block File Exists
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        contains(\n  blockId: BlockId): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        contains requests the DiskBlockManager for the block file and checks whether the file actually exists or not.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        contains\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to getStatus, getCurrentBlockStatus, getLocalValues, doGetLocalBytes, dropFromMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DiskStore is requested to put
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskStore/#persisting-block-to-disk","title":"Persisting Block to Disk
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        put(\n  blockId: BlockId)(\n  writeFunc: WritableByteChannel => Unit): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        put prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Attempting to put block [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        put requests the DiskBlockManager for the block file for the input BlockId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        put opens the block file for writing (wrapped into a CountingWritableChannel to count the bytes written). put executes the given writeFunc function (with the WritableByteChannel of the block file) and saves the bytes written (to the blockSizes registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In the end, put prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Block [fileName] stored as [size] file on disk in [time] ms\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In case of any exception, put deletes the block file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        put throws an IllegalStateException when the block is already stored:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Block [blockId] is already present in the disk store\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        put\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to doPutIterator and dropFromMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DiskStore is requested to putBytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskStore/#removing-block","title":"Removing Block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        remove(\n  blockId: BlockId): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        remove...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        remove\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManager is requested to removeBlockInternal
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • DiskStore is requested to put (and an IOException is thrown)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/DiskStore/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Enable ALL logging level for org.apache.spark.storage.DiskStore logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        log4j.logger.org.apache.spark.storage.DiskStore=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/ExternalBlockStoreClient/","title":"ExternalBlockStoreClient","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ExternalBlockStoreClient is a BlockStoreClient that the driver and executors use when spark.shuffle.service.enabled configuration property is enabled.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/ExternalBlockStoreClient/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ExternalBlockStoreClient takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • TransportConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SecretKeyHolder
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • authEnabled flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • registrationTimeoutMs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ExternalBlockStoreClient is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkEnv utility is requested to create a SparkEnv (for the driver and executors) with spark.shuffle.service.enabled configuration property enabled
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"storage/FallbackStorage/","title":"FallbackStorage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          FallbackStorage is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"storage/MemoryStore/","title":"MemoryStore","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          MemoryStore manages blocks of data in memory for BlockManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"storage/MemoryStore/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          MemoryStore takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BlockInfoManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SerializerManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • MemoryManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BlockEvictionHandler

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            MemoryStore is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/MemoryStore/#blocks","title":"Blocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            entries: LinkedHashMap[BlockId, MemoryEntry[_]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            MemoryStore creates a LinkedHashMap (Java) of blocks (as MemoryEntries per BlockId) when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            entries uses access-order ordering mode where the order of iteration is the order in which the entries were last accessed (from least-recently accessed to most-recently). That gives LRU cache behaviour when MemoryStore is requested to evict blocks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            MemoryEntries are added in putBytes and putIterator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            MemoryEntries are removed in remove, clear, and while evicting blocks to free up memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#deserializedmemoryentry","title":"DeserializedMemoryEntry

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            DeserializedMemoryEntry is a MemoryEntry for block values with the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Array[T] (for the values)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ON_HEAP memory mode
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#serializedmemoryentry","title":"SerializedMemoryEntry

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SerializedMemoryEntry is a MemoryEntry for block bytes with the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • ChunkedByteBuffer (for the serialized values)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MemoryMode
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#sparkstorageunrollmemorythreshold","title":"spark.storage.unrollMemoryThreshold

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            MemoryStore uses spark.storage.unrollMemoryThreshold configuration property when requested for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • putIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • putIteratorAsBytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#evicting-blocks","title":"Evicting Blocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            evictBlocksToFreeSpace(\n  blockId: Option[BlockId],\n  space: Long,\n  memoryMode: MemoryMode): Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            evictBlocksToFreeSpace finds blocks to evict in the entries registry (based on least-recently accessed order and until the required space to free up is met or there are no more blocks).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Once done, evictBlocksToFreeSpace returns the memory freed up.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When there is enough blocks to drop to free up memory, evictBlocksToFreeSpace prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            [n] blocks selected for dropping ([freedMemory]) bytes)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            evictBlocksToFreeSpace drops the blocks one by one.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            evictBlocksToFreeSpace prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            After dropping [n] blocks, free memory is [memory]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When there is not enough blocks to drop to make room for the given block (if any), evictBlocksToFreeSpace prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Will not store [blockId]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            evictBlocksToFreeSpace\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • StorageMemoryPool is requested to acquire memory and free up space to shrink pool
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#dropping-block","title":"Dropping Block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            dropBlock[T](\n  blockId: BlockId,\n  entry: MemoryEntry[T]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            dropBlock requests the BlockEvictionHandler to drop the block from memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If the block is no longer available in any other store, dropBlock requests the BlockInfoManager to remove the block (info).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#blockinfomanager","title":"BlockInfoManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            MemoryStore is given a BlockInfoManager when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            MemoryStore uses the BlockInfoManager when requested to evict blocks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#accessing-memorystore","title":"Accessing MemoryStore

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            MemoryStore is available to other Spark services using BlockManager.memoryStore.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            import org.apache.spark.SparkEnv\nSparkEnv.get.blockManager.memoryStore\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#serialized-block-bytes","title":"Serialized Block Bytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getBytes(\n  blockId: BlockId): Option[ChunkedByteBuffer]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getBytes returns the bytes of the SerializedMemoryEntry of a block (if found in the entries registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getBytes is used (for blocks with a serialized and in-memory storage level) when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested for the serialized bytes of a block (from a local block manager), getLocalValues, maybeCacheDiskBytesInMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#fetching-deserialized-block-values","title":"Fetching Deserialized Block Values
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getValues(\n  blockId: BlockId): Option[Iterator[_]]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getValues returns the values of the DeserializedMemoryEntry of the given block (if available in the entries registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            getValues is used (for blocks with a deserialized and in-memory storage level) when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested for the serialized bytes of a block (from a local block manager), getLocalValues, maybeCacheDiskBytesInMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#putiteratorasbytes","title":"putIteratorAsBytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIteratorAsBytes[T](\n  blockId: BlockId,\n  values: Iterator[T],\n  classTag: ClassTag[T],\n  memoryMode: MemoryMode): Either[PartiallySerializedBlock[T], Long]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIteratorAsBytes requires that the block is not already stored.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIteratorAsBytes putIterator (with the given BlockId, the values, the MemoryMode and a new SerializedValuesHolder).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If successful, putIteratorAsBytes returns the estimated size of the block. Otherwise, a PartiallySerializedBlock.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIteratorAsBytes prints out the following WARN message to the logs when the initial memory threshold is too large:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Initial memory threshold of [initialMemoryThreshold] is too large to be set as chunk size.\nChunk size has been capped to \"MAX_ROUNDED_ARRAY_LENGTH\"\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIteratorAsBytes\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to doPutIterator (for a block with StorageLevel with useMemory and serialized)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#putiteratorasvalues","title":"putIteratorAsValues
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIteratorAsValues[T](\n  blockId: BlockId,\n  values: Iterator[T],\n  memoryMode: MemoryMode,\n  classTag: ClassTag[T]): Either[PartiallyUnrolledIterator[T], Long]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIteratorAsValues putIterator (with the given BlockId, the values, the MemoryMode and a new DeserializedValuesHolder).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If successful, putIteratorAsValues returns the estimated size of the block. Otherwise, a PartiallyUnrolledIterator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIteratorAsValues\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockStoreUpdater is requested to saveDeserializedValuesToMemoryStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to doPutIterator and maybeCacheDiskValuesInMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#putiterator","title":"putIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIterator[T](\n  blockId: BlockId,\n  values: Iterator[T],\n  classTag: ClassTag[T],\n  memoryMode: MemoryMode,\n  valuesHolder: ValuesHolder[T]): Either[Long, Long]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIterator returns the (estimated) size of the block (as Right) or the unrollMemoryUsedByThisBlock (as Left).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIterator requires that the block is not already in the MemoryStore.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIterator reserveUnrollMemoryForThisTask (with the spark.storage.unrollMemoryThreshold for the initial memory threshold).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If putIterator did not manage to reserve the memory for unrolling (computing block in memory), it prints out the following WARN message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Failed to reserve initial memory threshold of [initialMemoryThreshold]\nfor computing block [blockId] in memory.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIterator requests the ValuesHolder to storeValue for every value in the given values iterator. putIterator checks memory usage regularly (whether it may have exceeded the threshold) and reserveUnrollMemoryForThisTask when needed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIterator requests the ValuesHolder for a MemoryEntryBuilder (getBuilder) that in turn is requested to build a MemoryEntry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIterator releaseUnrollMemoryForThisTask.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIterator requests the MemoryManager to acquireStorageMemory and stores the block (in the entries registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In the end, putIterator prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Block [blockId] stored as values in memory (estimated size [size], free [free])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In case of putIterator not having enough memory to store the block, putIterator logUnrollFailureMessage and returns the unrollMemoryUsedByThisBlock.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putIterator\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MemoryStore is requested to putIteratorAsValues and putIteratorAsBytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#logunrollfailuremessage","title":"logUnrollFailureMessage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            logUnrollFailureMessage(\n  blockId: BlockId,\n  finalVectorSize: Long): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            logUnrollFailureMessage prints out the following WARN message to the logs and logMemoryUsage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Not enough space to cache [blockId] in memory! (computed [size] so far)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#logmemoryusage","title":"logMemoryUsage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            logMemoryUsage(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            logMemoryUsage prints out the following INFO message to the logs (with the blocksMemoryUsed, currentUnrollMemory, numTasksUnrolling, memoryUsed, and maxMemory):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Memory use = [blocksMemoryUsed] (blocks) + [currentUnrollMemory]\n(scratch space shared across [numTasksUnrolling] tasks(s)) = [memoryUsed].\nStorage limit = [maxMemory].\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#storing-block","title":"Storing Block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putBytes[T: ClassTag](\n  blockId: BlockId,\n  size: Long,\n  memoryMode: MemoryMode,\n  _bytes: () => ChunkedByteBuffer): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putBytes returns true only after there was enough memory to store the block (BlockId) in entries registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putBytes asserts that the block is not stored yet.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putBytes requests the MemoryManager for memory (to store the block) and, when successful, adds the block to the entries registry (as a SerializedMemoryEntry with the _bytes and the MemoryMode).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In the end, putBytes prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Block [blockId] stored as bytes in memory (estimated size [size], free [size])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            putBytes is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockStoreUpdater is requested to save serialized values (to MemoryStore)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to maybeCacheDiskBytesInMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#memory-used-for-caching-blocks","title":"Memory Used for Caching Blocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            blocksMemoryUsed: Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            blocksMemoryUsed is the total memory used without (minus) the memory used for unrolling.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            blocksMemoryUsed is used for logging purposes (when MemoryStore is requested to putBytes, putIterator, remove, evictBlocksToFreeSpace and logMemoryUsage).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#total-storage-memory-in-use","title":"Total Storage Memory in Use
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            memoryUsed: Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            memoryUsed requests the MemoryManager for the total storage memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            memoryUsed is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MemoryStore is requested for blocksMemoryUsed and to logMemoryUsage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#maximum-storage-memory","title":"Maximum Storage Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            maxMemory: Long\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            maxMemory is the total amount of memory available for storage (in bytes) and is the sum of the maxOnHeapStorageMemory and maxOffHeapStorageMemory of the MemoryManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Tip

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Enable INFO logging for MemoryStore to print out the maxMemory to the logs when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            MemoryStore started with capacity [maxMemory] MB\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            maxMemory is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MemoryStore is requested for the blocksMemoryUsed and to logMemoryUsage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#dropping-block-from-memory","title":"Dropping Block from Memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            remove(\n  blockId: BlockId): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            remove returns true when the given block (BlockId) was (found and) removed from the entries registry successfully and the memory released (from the MemoryManager).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            remove removes (drops) the block (BlockId) from the entries registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If found and removed, remove requests the MemoryManager to releaseStorageMemory and prints out the following DEBUG message to the logs (with the maxMemory and blocksMemoryUsed):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Block [blockId] of size [size] dropped from memory (free [memory])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            remove\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockManager is requested to dropFromMemory and removeBlockInternal
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#releasing-unroll-memory-for-task","title":"Releasing Unroll Memory for Task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            releaseUnrollMemoryForThisTask(\n  memoryMode: MemoryMode,\n  memory: Long = Long.MaxValue): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            releaseUnrollMemoryForThisTask finds the task attempt ID of the current task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            releaseUnrollMemoryForThisTask uses the onHeapUnrollMemoryMap or offHeapUnrollMemoryMap based on the given MemoryMode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            (Only when the unroll memory map contains the task attempt ID) releaseUnrollMemoryForThisTask descreases the memory registered in the unroll memory map by the given memory amount and requests the MemoryManager to releaseUnrollMemory. In the end, releaseUnrollMemoryForThisTask removes the task attempt ID (entry) from the unroll memory map if the memory used is 0.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            releaseUnrollMemoryForThisTask\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Task is requested to run (and is about to finish)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • MemoryStore is requested to putIterator
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • PartiallyUnrolledIterator is requested to releaseUnrollMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • PartiallySerializedBlock is requested to discard and finishWritingToStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/MemoryStore/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Enable ALL logging level for org.apache.spark.storage.memory.MemoryStore logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            log4j.logger.org.apache.spark.storage.memory.MemoryStore=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"storage/NettyBlockRpcServer/","title":"NettyBlockRpcServer","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NettyBlockRpcServer is a RpcHandler to handle messages for NettyBlockTransferService.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"storage/NettyBlockRpcServer/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NettyBlockRpcServer takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Application ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Serializer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • BlockDataManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NettyBlockRpcServer is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • NettyBlockTransferService is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"storage/NettyBlockRpcServer/#oneforonestreammanager","title":"OneForOneStreamManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NettyBlockRpcServer uses a OneForOneStreamManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"storage/NettyBlockRpcServer/#receiving-rpc-messages","title":"Receiving RPC Messages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              receive(\n  client: TransportClient,\n  rpcMessage: ByteBuffer,\n  responseContext: RpcResponseCallback): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              receive deserializes the incoming RPC message (from ByteBuffer to BlockTransferMessage) and prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Received request: [message]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              receive handles the message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              receive\u00a0is part of the RpcHandler abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"storage/NettyBlockRpcServer/#fetchshuffleblocks","title":"FetchShuffleBlocks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              FetchShuffleBlocks carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Application ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Shuffle ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Map IDs (long[])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Reduce IDs (long[][])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • batchFetchEnabled flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              When received, receive...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              receive prints out the following TRACE message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Registered streamId [streamId] with [numBlockIds] buffers\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, receive responds with a StreamHandle (with the streamId and the number of blocks). The response is serialized to a ByteBuffer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              FetchShuffleBlocks is posted when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • OneForOneBlockFetcher is requested to createFetchShuffleBlocksMsgAndBuildBlockIds
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"storage/NettyBlockRpcServer/#getlocaldirsforexecutors","title":"GetLocalDirsForExecutors","text":""},{"location":"storage/NettyBlockRpcServer/#openblocks","title":"OpenBlocks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              OpenBlocks carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Application ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Block IDs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              When received, receive...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              receive prints out the following TRACE message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Registered streamId [streamId] with [blocksNum] buffers\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, receive responds with a StreamHandle (with the streamId and the number of blocks). The response is serialized to a ByteBuffer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              OpenBlocks is posted when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • OneForOneBlockFetcher is requested to start
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"storage/NettyBlockRpcServer/#uploadblock","title":"UploadBlock

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              UploadBlock carries the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Application ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Block ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Metadata (byte[])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Block Data (byte[])

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              When received, receive deserializes the metadata to get the StorageLevel and ClassTag of the block being uploaded.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              receive...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              UploadBlock is posted when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • NettyBlockTransferService is requested to upload a block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"storage/NettyBlockRpcServer/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Enable ALL logging level for org.apache.spark.network.netty.NettyBlockRpcServer logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              log4j.logger.org.apache.spark.network.netty.NettyBlockRpcServer=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"storage/NettyBlockTransferService/","title":"NettyBlockTransferService","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NettyBlockTransferService is a BlockTransferService that uses Netty for uploading and fetching blocks of data.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"storage/NettyBlockTransferService/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NettyBlockTransferService takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SecurityManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Bind Address
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Host Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Port
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Number of CPU Cores
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Driver RpcEndpointRef

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                NettyBlockTransferService is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkEnv utility is used to create a SparkEnv (for the driver and executors and creates a BlockManager)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"storage/NettyBlockTransferService/#initializing","title":"Initializing
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                init(\n  blockDataManager: BlockDataManager): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                init\u00a0is part of the BlockTransferService abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                init creates a NettyBlockRpcServer (with the application ID, a JavaSerializer and the given BlockDataManager).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                init creates a TransportContext (with the NettyBlockRpcServer just created) and requests it for a TransportClientFactory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                init createServer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In the end, init prints out the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Server created on [hostName]:[port]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/NettyBlockTransferService/#fetching-blocks","title":"Fetching Blocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                fetchBlocks(\n  host: String,\n  port: Int,\n  execId: String,\n  blockIds: Array[String],\n  listener: BlockFetchingListener,\n  tempFileManager: DownloadFileManager): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                fetchBlocks prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Fetch blocks from [host]:[port] (executor id [execId])\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                fetchBlocks requests the TransportConf for the maxIORetries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                fetchBlocks creates a BlockTransferStarter.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                With the maxIORetries above zero, fetchBlocks creates a RetryingBlockFetcher (with the BlockFetchStarter, the blockIds and the BlockFetchingListener) and starts it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Otherwise, fetchBlocks requests the BlockFetchStarter to createAndStart (with the blockIds and the BlockFetchingListener).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In case of any Exception, fetchBlocks prints out the following ERROR message to the logs and the given BlockFetchingListener gets notified.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Exception while beginning fetchBlocks\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                fetchBlocks\u00a0is part of the BlockStoreClient abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/NettyBlockTransferService/#blocktransferstarter","title":"BlockTransferStarter

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                fetchBlocks creates a BlockTransferStarter. When requested to createAndStart, the BlockTransferStarter requests the TransportClientFactory to create a TransportClient.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                createAndStart creates an OneForOneBlockFetcher and requests it to start.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/NettyBlockTransferService/#ioexception","title":"IOException

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In case of an IOException, createAndStart requests the driver RpcEndpointRef to send an IsExecutorAlive message synchronously (with the given execId).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                If the driver RpcEndpointRef replied false, createAndStart throws an ExecutorDeadException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The relative remote executor(Id: [execId]),\nwhich maintains the block data to fetch is dead.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Otherwise, createAndStart (re)throws the IOException.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/NettyBlockTransferService/#uploading-block","title":"Uploading Block
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                uploadBlock(\n  hostname: String,\n  port: Int,\n  execId: String,\n  blockId: BlockId,\n  blockData: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Future[Unit]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                uploadBlock\u00a0is part of the BlockTransferService abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                uploadBlock creates a TransportClient (with the given hostname and port).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                uploadBlock serializes the given StorageLevel and ClassTag (using a JavaSerializer).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                uploadBlock uses a stream to transfer shuffle blocks when one of the following holds:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                1. The size of the block data (ManagedBuffer) is above spark.network.maxRemoteBlockSizeFetchToMem configuration property
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2. The given BlockId is a shuffle block

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                For stream transfer uploadBlock requests the TransportClient to uploadStream. Otherwise, uploadBlock requests the TransportClient to sendRpc a UploadBlock message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                UploadBlock message is processed by NettyBlockRpcServer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                With the upload successful, uploadBlock prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Successfully uploaded block [blockId] [as stream]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                With the upload failed, uploadBlock prints out the following ERROR message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Error while uploading block [blockId] [as stream]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/NettyBlockTransferService/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Enable ALL logging level for org.apache.spark.network.netty.NettyBlockTransferService logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                log4j.logger.org.apache.spark.network.netty.NettyBlockTransferService=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"storage/OneForOneBlockFetcher/","title":"OneForOneBlockFetcher","text":""},{"location":"storage/OneForOneBlockFetcher/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                OneForOneBlockFetcher takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • TransportClient
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Application ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Block IDs (String[])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • BlockFetchingListener
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • TransportConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • DownloadFileManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  OneForOneBlockFetcher is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • NettyBlockTransferService is requested to fetch blocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExternalBlockStoreClient is requested to fetch blocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"storage/OneForOneBlockFetcher/#createfetchshuffleblocksmsg","title":"createFetchShuffleBlocksMsg
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  FetchShuffleBlocks createFetchShuffleBlocksMsg(\n  String appId,\n  String execId,\n  String[] blockIds)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createFetchShuffleBlocksMsg...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/OneForOneBlockFetcher/#starting","title":"Starting
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  void start()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start requests the TransportClient to sendRpc the BlockTransferMessage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  start\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ExternalBlockStoreClient is requested to fetchBlocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • NettyBlockTransferService is requested to fetchBlocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/OneForOneBlockFetcher/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Enable ALL logging level for org.apache.spark.network.shuffle.OneForOneBlockFetcher logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  log4j.logger.org.apache.spark.network.shuffle.OneForOneBlockFetcher=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"storage/RDDInfo/","title":"RDDInfo","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  RDDInfo is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"storage/RandomBlockReplicationPolicy/","title":"RandomBlockReplicationPolicy","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  RandomBlockReplicationPolicy is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"storage/ShuffleBlockFetcherIterator/","title":"ShuffleBlockFetcherIterator","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ShuffleBlockFetcherIterator is an Iterator[(BlockId, InputStream)] (Scala) that fetches shuffle blocks from local or remote BlockManagers (and makes them available as an InputStream).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ShuffleBlockFetcherIterator allows for a synchronous iteration over shuffle blocks so a caller can handle them in a pipelined fashion as they are received.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ShuffleBlockFetcherIterator is exhausted (and can provide no elements) when the number of blocks already processed is at least the total number of blocks to fetch.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ShuffleBlockFetcherIterator throttles the remote fetches to avoid consuming too much memory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"storage/ShuffleBlockFetcherIterator/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ShuffleBlockFetcherIterator takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • TaskContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockStoreClient
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • BlockManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Blocks to Fetch by Address (Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Stream Wrapper Function ((BlockId, InputStream) => InputStream)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • spark.reducer.maxSizeInFlight
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • spark.reducer.maxReqsInFlight
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • spark.reducer.maxBlocksInFlightPerAddress
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • spark.network.maxRemoteBlockSizeFetchToMem
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • spark.shuffle.detectCorrupt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • spark.shuffle.detectCorrupt.useExtraMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • ShuffleReadMetricsReporter
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • doBatchFetch flag

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    While being created, ShuffleBlockFetcherIterator initializes itself.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleBlockFetcherIterator is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BlockStoreShuffleReader is requested to read combined key-value records for a reduce task
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"storage/ShuffleBlockFetcherIterator/#initializing","title":"Initializing
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    initialize(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    initialize registers a task cleanup and fetches shuffle blocks from remote and local storage:BlockManager.md[BlockManagers].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Internally, initialize uses the TaskContext to register the ShuffleFetchCompletionListener (to cleanup).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    initialize partitionBlocksByFetchMode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    initialize...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#partitionblocksbyfetchmode","title":"partitionBlocksByFetchMode
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    partitionBlocksByFetchMode(): ArrayBuffer[FetchRequest]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    partitionBlocksByFetchMode...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#collectfetchrequests","title":"collectFetchRequests
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    collectFetchRequests(\n  address: BlockManagerId,\n  blockInfos: Seq[(BlockId, Long, Int)],\n  collectedRemoteRequests: ArrayBuffer[FetchRequest]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    collectFetchRequests...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#createfetchrequests","title":"createFetchRequests
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    createFetchRequests(\n  curBlocks: Seq[FetchBlockInfo],\n  address: BlockManagerId,\n  isLast: Boolean,\n  collectedRemoteRequests: ArrayBuffer[FetchRequest]): Seq[FetchBlockInfo]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    createFetchRequests...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#fetchuptomaxbytes","title":"fetchUpToMaxBytes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    fetchUpToMaxBytes(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    fetchUpToMaxBytes...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    fetchUpToMaxBytes is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleBlockFetcherIterator is requested to initialize and next
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#sending-remote-shuffle-block-fetch-request","title":"Sending Remote Shuffle Block Fetch Request
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    sendRequest(\n  req: FetchRequest): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    sendRequest prints out the following DEBUG message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Sending request for [n] blocks ([size]) from [hostPort]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    sendRequest add the size of the blocks in the FetchRequest to bytesInFlight and increments the reqsInFlight internal counters.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    sendRequest requests the ShuffleClient to fetch the blocks with a new BlockFetchingListener (and this ShuffleBlockFetcherIterator when the size of the blocks in the FetchRequest is higher than the maxReqSizeShuffleToMem).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    sendRequest is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleBlockFetcherIterator is requested to fetch remote shuffle blocks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#blockfetchinglistener","title":"BlockFetchingListener

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    sendRequest creates a new BlockFetchingListener to be notified about successes or failures of shuffle block fetch requests.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#onblockfetchsuccess","title":"onBlockFetchSuccess

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    On onBlockFetchSuccess the BlockFetchingListener adds a SuccessFetchResult to the results registry and prints out the following DEBUG message to the logs (when not a zombie):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    remainingBlocks: [remainingBlocks]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In the end, onBlockFetchSuccess prints out the following TRACE message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Got remote block [blockId] after [time]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#onblockfetchfailure","title":"onBlockFetchFailure

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    On onBlockFetchFailure the BlockFetchingListener adds a FailureFetchResult to the results registry and prints out the following ERROR message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Failed to get block(s) from [host]:[port]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#fetchresults","title":"FetchResults
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    results: LinkedBlockingQueue[FetchResult]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleBlockFetcherIterator uses an internal FIFO blocking queue (Java) of FetchResults.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    results is used for fetching the next element.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    For remote blocks, FetchResults are added in sendRequest:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SuccessFetchResults after a BlockFetchingListener is notified about onBlockFetchSuccess
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • FailureFetchResults after a BlockFetchingListener is notified about onBlockFetchFailure

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    For local blocks, FetchResults are added in fetchLocalBlocks:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SuccessFetchResults after the BlockManager has successfully getLocalBlockData
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • FailureFetchResults otherwise

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    For local blocks, FetchResults are added in fetchHostLocalBlock:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SuccessFetchResults after the BlockManager has successfully getHostLocalShuffleData
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • FailureFetchResults otherwise

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    FailureFetchResults can also be added in fetchHostLocalBlocks.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Cleaned up in cleanup

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#hasnext","title":"hasNext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    hasNext: Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    hasNext\u00a0is part of the Iterator (Scala) abstraction (to test whether this iterator can provide another element).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    hasNext is true when numBlocksProcessed is below numBlocksToFetch.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#retrieving-next-element","title":"Retrieving Next Element
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    next(): (BlockId, InputStream)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    next increments the numBlocksProcessed registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    next takes (and removes) the head of the results queue.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    next requests the ShuffleReadMetricsReporter to incFetchWaitTime.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    next...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    next throws a NoSuchElementException if there is no element left.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    next is part of the Iterator (Scala) abstraction (to produce the next element of this iterator).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#numblocksprocessed","title":"numBlocksProcessed

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The number of blocks fetched and consumed

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#numblockstofetch","title":"numBlocksToFetch

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Total number of blocks to fetch and consume

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleBlockFetcherIterator can produce up to numBlocksToFetch elements.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    numBlocksToFetch is increased every time ShuffleBlockFetcherIterator is requested to partitionBlocksByFetchMode that prints it out as the INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Getting [numBlocksToFetch] non-empty blocks out of [totalBlocks] blocks\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#releasecurrentresultbuffer","title":"releaseCurrentResultBuffer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    releaseCurrentResultBuffer(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    releaseCurrentResultBuffer...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    releaseCurrentResultBuffer\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleBlockFetcherIterator is requested to cleanup
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BufferReleasingInputStream is requested to close
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#shufflefetchcompletionlistener","title":"ShuffleFetchCompletionListener

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleBlockFetcherIterator creates a ShuffleFetchCompletionListener when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleFetchCompletionListener is used when initialize and toCompletionIterator.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#cleaning-up","title":"Cleaning Up
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    cleanup(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    cleanup marks this ShuffleBlockFetcherIterator a zombie.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    cleanup releases the current result buffer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    cleanup iterates over results internal queue and for every SuccessFetchResult, increments remote bytes read and blocks fetched shuffle task metrics, and eventually releases the managed buffer.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#bytesinflight","title":"bytesInFlight

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The bytes of fetched remote shuffle blocks in flight

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Starts at 0 when ShuffleBlockFetcherIterator is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Incremented every sendRequest and decremented every next.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleBlockFetcherIterator makes sure that the invariant of bytesInFlight is below maxBytesInFlight every remote shuffle block fetch.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#reqsinflight","title":"reqsInFlight

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The number of remote shuffle block fetch requests in flight.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Starts at 0 when ShuffleBlockFetcherIterator is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Incremented every sendRequest and decremented every next.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleBlockFetcherIterator makes sure that the invariant of reqsInFlight is below maxReqsInFlight every remote shuffle block fetch.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#iszombie","title":"isZombie

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Controls whether ShuffleBlockFetcherIterator is still active and records SuccessFetchResults on successful shuffle block fetches.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Starts false when ShuffleBlockFetcherIterator is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Enabled (true) in cleanup.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    When enabled, registerTempFileToClean is a noop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#downloadfilemanager","title":"DownloadFileManager

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleBlockFetcherIterator is a DownloadFileManager.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#throwfetchfailedexception","title":"throwFetchFailedException
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    throwFetchFailedException(\n  blockId: BlockId,\n  mapIndex: Int,\n  address: BlockManagerId,\n  e: Throwable,\n  message: Option[String] = None): Nothing\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    throwFetchFailedException takes the message (if defined) or uses the message of the given Throwable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    In the end, throwFetchFailedException throws a FetchFailedException if the BlockId is either a ShuffleBlockId or a ShuffleBlockBatchId. Otherwise, throwFetchFailedException throws a SparkException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Failed to get block [blockId], which is not a shuffle block\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    throwFetchFailedException\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleBlockFetcherIterator is requested to next
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BufferReleasingInputStream is requested to tryOrFetchFailedException
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Enable ALL logging level for org.apache.spark.storage.ShuffleBlockFetcherIterator logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    log4j.logger.org.apache.spark.storage.ShuffleBlockFetcherIterator=ALL\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Refer to Logging.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"storage/ShuffleFetchCompletionListener/","title":"ShuffleFetchCompletionListener","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleFetchCompletionListener is a TaskCompletionListener (that ShuffleBlockFetcherIterator uses to clean up after the owning task is completed).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"storage/ShuffleFetchCompletionListener/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ShuffleFetchCompletionListener takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ShuffleBlockFetcherIterator

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ShuffleFetchCompletionListener is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ShuffleBlockFetcherIterator is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"storage/ShuffleFetchCompletionListener/#ontaskcompletion","title":"onTaskCompletion
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onTaskCompletion(\n  context: TaskContext): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onTaskCompletion\u00a0is part of the TaskCompletionListener abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      onTaskCompletion requests the ShuffleBlockFetcherIterator (if available) to cleanup.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In the end, onTaskCompletion nulls out the reference to the ShuffleBlockFetcherIterator (to make it available for garbage collection).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"storage/ShuffleMetricsSource/","title":"ShuffleMetricsSource","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      = ShuffleMetricsSource

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ShuffleMetricsSource is the metrics:spark-metrics-Source.md[metrics source] of a storage:BlockManager.md[] for <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ShuffleMetricsSource lives on a Spark executor and is executor:Executor.md#creating-instance-BlockManager-shuffleMetricsSource[registered only when a Spark application runs in a non-local / cluster mode].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      .Registering ShuffleMetricsSource with \"executor\" MetricsSystem image::ShuffleMetricsSource.png[align=\"center\"]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      == [[creating-instance]] Creating Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ShuffleMetricsSource takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • <>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ShuffleMetricsSource is created when BlockManager is requested for the storage:BlockManager.md#shuffleMetricsSource[shuffle metrics source].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[sourceName]] Source Name

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ShuffleMetricsSource is given a name when <> that is one of the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • NettyBlockTransfer when spark.shuffle.service.enabled configuration property is off (false)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ExternalShuffle when spark.shuffle.service.enabled configuration property is on (true)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/ShuffleMigrationRunnable/","title":"ShuffleMigrationRunnable","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ShuffleMigrationRunnable is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/StorageLevel/","title":"StorageLevel","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        StorageLevel is the following flags for controlling the storage of an RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Flag Default Value useDisk false useMemory true useOffHeap false deserialized false replication 1","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#restrictions","title":"Restrictions","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        1. The replication is restricted to be less than 40 (for calculating the hash code)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        2. Off-heap storage level does not support deserialized storage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#validation","title":"Validation
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        isValid: Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        StorageLevel is considered valid when the following all hold:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        1. Uses memory or disk
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        2. Replication is non-zero positive number (between the default 1 and 40)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#externalizable","title":"Externalizable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        DirectTaskResult is an Externalizable (Java).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#writeexternal","title":"writeExternal
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        writeExternal(\n  out: ObjectOutput): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        writeExternal\u00a0is part of the Externalizable (Java) abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        writeExternal writes the bitwise representation out followed by the replication of this StorageLevel.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#bitwise-integer-representation","title":"Bitwise Integer Representation
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        toInt: Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        toInt converts this StorageLevel to numeric (binary) representation by turning the corresponding bits on for the following (if used and in that order):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        1. deserialized
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        2. useOffHeap
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        3. useMemory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        4. useDisk

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        In other words, the following number in bitwise representation says the StorageLevel is deserialized and useMemory:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        import org.apache.spark.storage.StorageLevel.MEMORY_ONLY\nassert(MEMORY_ONLY.toInt == (0 | 1 | 4))\n\nscala> println(MEMORY_ONLY.toInt.toBinaryString)\n101\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        toInt\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • StorageLevel is requested to writeExternal and hashCode
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":"","tags":["DeveloperApi"]},{"location":"storage/StorageStatus/","title":"StorageStatus","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        == [[StorageStatus]] StorageStatus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        StorageStatus is a developer API that Spark uses to pass \"just enough\" information about registered storage:BlockManager.md[BlockManagers] in a Spark application between Spark services (mostly for monitoring purposes like spark-webui.md[web UI] or SparkListener.md[]s).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/StorageStatus/#note","title":"[NOTE]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        There are two ways to access StorageStatus about all the known BlockManagers in a Spark application:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkContext.md#getExecutorStorageStatus[SparkContext.getExecutorStorageStatus]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/StorageStatus/#being-a-sparklistenermd-and-intercepting-sparklistenermdonblockmanageraddedonblockmanageradded-and-sparklistenermdonblockmanagerremovedonblockmanagerremoved-events","title":"* Being a SparkListener.md[] and intercepting SparkListener.md#onBlockManagerAdded[onBlockManagerAdded] and SparkListener.md#onBlockManagerRemoved[onBlockManagerRemoved] events","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        StorageStatus is <> when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManagerMasterEndpoint storage:BlockManagerMasterEndpoint.md#storageStatus[is requested for storage status] (of every storage:BlockManager.md[BlockManager] in a Spark application)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        [[internal-registries]] .StorageStatus's Internal Registries and Counters [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | [[_nonRddBlocks]] _nonRddBlocks | Lookup table of BlockIds per BlockId.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | [[_rddBlocks]] _rddBlocks | Lookup table of BlockIds with BlockStatus per RDD id.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Used when...FIXME |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[updateStorageInfo]] updateStorageInfo Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/StorageStatus/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        updateStorageInfo( blockId: BlockId, newBlockStatus: BlockStatus): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        updateStorageInfo...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: updateStorageInfo is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[creating-instance]] Creating StorageStatus Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        StorageStatus takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • [[blockManagerId]] storage:BlockManagerId.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • [[maxMem]] Maximum memory -- storage:BlockManager.md#maxMemory[total available on-heap and off-heap memory for storage on the BlockManager]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        StorageStatus initializes the <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[rddBlocksById]] Getting RDD Blocks For RDD -- rddBlocksById Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/StorageStatus/#source-scala_1","title":"[source, scala]","text":""},{"location":"storage/StorageStatus/#rddblocksbyidrddid-int-mapblockid-blockstatus","title":"rddBlocksById(rddId: Int): Map[BlockId, BlockStatus]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        rddBlocksById gives the blocks (as BlockId with their status as BlockStatus) that belong to rddId RDD.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[removeBlock]] Removing Block (From Internal Registries) -- removeBlock Internal Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/StorageStatus/#source-scala_2","title":"[source, scala]","text":""},{"location":"storage/StorageStatus/#removeblockblockid-blockid-optionblockstatus","title":"removeBlock(blockId: BlockId): Option[BlockStatus]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        removeBlock removes blockId from <<_rddBlocks, _rddBlocks>> registry and returns it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Internally, removeBlock <> of blockId (to be empty, i.e. removed).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        removeBlock branches off per the type of storage:BlockId.md[], i.e. RDDBlockId or not.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        For a RDDBlockId, removeBlock finds the RDD in <<_rddBlocks, _rddBlocks>> and removes the blockId. removeBlock removes the RDD (from <<_rddBlocks, _rddBlocks>>) completely, if there are no more blocks registered.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        For a non-RDDBlockId, removeBlock removes blockId from <<_nonRddBlocks, _nonRddBlocks>> registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[addBlock]] Registering Status of Data Block -- addBlock Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/StorageStatus/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        addBlock( blockId: BlockId, blockStatus: BlockStatus): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        addBlock...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: addBlock is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        === [[getBlock]] getBlock Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/StorageStatus/#source-scala_4","title":"[source, scala]","text":""},{"location":"storage/StorageStatus/#getblockblockid-blockid-optionblockstatus","title":"getBlock(blockId: BlockId): Option[BlockStatus]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getBlock...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: getBlock is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/StorageUtils/","title":"StorageUtils","text":""},{"location":"storage/StorageUtils/#port-of-external-shuffle-service","title":"Port of External Shuffle Service
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        externalShuffleServicePort(\n  conf: SparkConf): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        externalShuffleServicePort...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        externalShuffleServicePort\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManager is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockManagerMasterEndpoint is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"storage/TempFileBasedBlockStoreUpdater/","title":"TempFileBasedBlockStoreUpdater","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TempFileBasedBlockStoreUpdater is a BlockStoreUpdater (that BlockManager uses for storing a block from bytes in a local temporary file).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"storage/TempFileBasedBlockStoreUpdater/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TempFileBasedBlockStoreUpdater takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BlockId
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • StorageLevel
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ClassTag (Scala)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Temporary File
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Block Size
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • tellMaster flag (default: true)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • keepReadLock flag (default: false)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          TempFileBasedBlockStoreUpdater is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • BlockManager is requested to putBlockDataAsStream
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • PythonBroadcast is requested to readObject
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"storage/TempFileBasedBlockStoreUpdater/#block-data","title":"Block Data
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          blockData(): BlockData\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          blockData requests the DiskStore (of the parent BlockManager) to getBytes (with the temp file and the block size).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          blockData\u00a0is part of the BlockStoreUpdater abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"storage/TempFileBasedBlockStoreUpdater/#storing-block-to-disk","title":"Storing Block to Disk
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          saveToDiskStore(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          saveToDiskStore requests the DiskStore (of the parent BlockManager) to moveFileToBlock.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          saveToDiskStore\u00a0is part of the BlockStoreUpdater abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"tools/","title":"Spark Tools","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Main abstractions:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • AbstractCommandBuilder
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"tools/AbstractCommandBuilder/","title":"AbstractCommandBuilder","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          AbstractCommandBuilder is an abstraction of launch command builders.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"tools/AbstractCommandBuilder/#contract","title":"Contract","text":""},{"location":"tools/AbstractCommandBuilder/#buildCommand","title":"Building Command","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          List<String> buildCommand(\n  Map<String, String> env)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Builds a command to launch a script on command line

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          See:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkClassCommandBuilder
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkSubmitCommandBuilder

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Main is requested to build a command
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"tools/AbstractCommandBuilder/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkClassCommandBuilder
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkSubmitCommandBuilder
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • WorkerCommandBuilder
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"tools/AbstractCommandBuilder/#buildjavacommand","title":"buildJavaCommand
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          List<String> buildJavaCommand(\n  String extraClassPath)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          buildJavaCommand builds the Java command for a Spark application (which is a collection of elements with the path to java executable, JVM options from java-opts file, and a class path).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          If javaHome is set, buildJavaCommand adds [javaHome]/bin/java to the result Java command. Otherwise, it uses JAVA_HOME or, when no earlier checks succeeded, falls through to java.home Java's system property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          CAUTION: FIXME Who sets javaHome internal property and when?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          buildJavaCommand loads extra Java options from the java-opts file in configuration directory if the file exists and adds them to the result Java command.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Eventually, buildJavaCommand builds the class path (with the extra class path if non-empty) and adds it as -cp to the result Java command.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"tools/AbstractCommandBuilder/#buildclasspath","title":"buildClassPath
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          List<String> buildClassPath(\n  String appClassPath)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          buildClassPath builds the classpath for a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Directories always end up with the OS-specific file separator at the end of their paths.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          buildClassPath adds the following in that order:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          1. SPARK_CLASSPATH environment variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2. The input appClassPath
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          3. The configuration directory
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          4. (only with SPARK_PREPEND_CLASSES set or SPARK_TESTING being 1) Locally compiled Spark classes in classes, test-classes and Core's jars. + CAUTION: FIXME Elaborate on \"locally compiled Spark classes\".

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          5. (only with SPARK_SQL_TESTING being 1) ... + CAUTION: FIXME Elaborate on the SQL testing case

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          6. HADOOP_CONF_DIR environment variable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          7. YARN_CONF_DIR environment variable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          8. SPARK_DIST_CLASSPATH environment variable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: childEnv is queried first before System properties. It is always empty for AbstractCommandBuilder (and SparkSubmitCommandBuilder, too).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"tools/AbstractCommandBuilder/#loading-properties-file","title":"Loading Properties File
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Properties loadPropertiesFile()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          loadPropertiesFile loads Spark settings from a properties file (when specified on the command line) or spark-defaults.conf in the configuration directory.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          loadPropertiesFile loads the settings from the following files starting from the first and checking every location until the first properties file is found:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          1. propertiesFile (if specified using --properties-file command-line option or set by AbstractCommandBuilder.setPropertiesFile).
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2. [SPARK_CONF_DIR]/spark-defaults.conf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          3. [SPARK_HOME]/conf/spark-defaults.conf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"tools/AbstractCommandBuilder/#sparks-configuration-directory","title":"Spark's Configuration Directory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          AbstractCommandBuilder uses getConfDir to compute the current configuration directory of a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          It uses SPARK_CONF_DIR (from childEnv which is always empty anyway or as a environment variable) and falls through to [SPARK_HOME]/conf (with SPARK_HOME from getSparkHome).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"tools/AbstractCommandBuilder/#sparks-home-directory","title":"Spark's Home Directory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          AbstractCommandBuilder uses getSparkHome to compute Spark's home directory for a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          It uses SPARK_HOME (from childEnv which is always empty anyway or as a environment variable).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          If SPARK_HOME is not set, Spark throws a IllegalStateException:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Spark home not found; set it explicitly or use the SPARK_HOME environment variable.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"tools/AbstractCommandBuilder/#appResource","title":"Application Resource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          String appResource\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          AbstractCommandBuilder uses appResource variable for the name of an application resource.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          appResource can be one of the following application resource names:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Identifier appResource pyspark-shell-main pyspark-shell-main sparkr-shell-main sparkr-shell-main run-example findExamplesAppJar pyspark-shell buildPySparkShellCommand sparkr-shell buildSparkRCommand

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          appResource can be specified when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • AbstractLauncher is requested to setAppResource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkSubmitCommandBuilder is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkSubmitCommandBuilder.OptionParser is requested to handle known or unknown options

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          appResource is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkLauncher is requested to startApplication
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkSubmitCommandBuilder is requested to build a command, buildSparkSubmitArgs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"tools/AbstractLauncher/","title":"AbstractLauncher","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          AbstractLauncher is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"tools/DependencyUtils/","title":"DependencyUtils Utilities","text":""},{"location":"tools/DependencyUtils/#resolveglobpaths","title":"resolveGlobPaths
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          resolveGlobPaths(\n  paths: String,\n  hadoopConf: Configuration): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          resolveGlobPaths...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          resolveGlobPaths is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkSubmit is requested to prepareSubmitEnvironment
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • DependencyUtils is used to resolveAndDownloadJars
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"tools/DependencyUtils/#downloadfile","title":"downloadFile
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          downloadFile(\n  path: String,\n  targetDir: File,\n  sparkConf: SparkConf,\n  hadoopConf: Configuration,\n  secMgr: SecurityManager): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          downloadFile resolves the path to a well-formed URI and branches off based on the scheme:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • For file and local schemes, downloadFile returns the input path
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • For other schemes, downloadFile...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          downloadFile is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkSubmit is requested to prepareSubmitEnvironment
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • DependencyUtils is used to downloadFileList
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"tools/DependencyUtils/#downloadfilelist","title":"downloadFileList
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          downloadFileList(\n  fileList: String,\n  targetDir: File,\n  sparkConf: SparkConf,\n  hadoopConf: Configuration,\n  secMgr: SecurityManager): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          downloadFileList...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          downloadFileList is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkSubmit is requested to prepareSubmitEnvironment
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • DependencyUtils is used to resolveAndDownloadJars
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"tools/DependencyUtils/#resolvemavendependencies","title":"resolveMavenDependencies
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          resolveMavenDependencies(\n  packagesExclusions: String,\n  packages: String,\n  repositories: String,\n  ivyRepoPath: String,\n  ivySettingsPath: Option[String]): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          resolveMavenDependencies...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          resolveMavenDependencies is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkSubmit is requested to prepareSubmitEnvironment (for all resource managers but Spark Standalone and Apache Mesos)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"tools/DependencyUtils/#adding-local-jars-to-classloader","title":"Adding Local Jars to ClassLoader
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          addJarToClasspath(\n  localJar: String,\n  loader: MutableURLClassLoader): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          addJarToClasspath adds file and local jars (as localJar) to the loader classloader.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          addJarToClasspath resolves the URI of localJar. If the URI is file or local and the file denoted by localJar exists, localJar is added to loader. Otherwise, the following warning is printed out to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Warning: Local jar /path/to/fake.jar does not exist, skipping.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          For all other URIs, the following warning is printed out to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Warning: Skip remote jar hdfs://fake.jar.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          addJarToClasspath assumes file URI when localJar has no URI specified, e.g. /path/to/local.jar.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"tools/DependencyUtils/#resolveanddownloadjars","title":"resolveAndDownloadJars
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          resolveAndDownloadJars(\n  jars: String,\n  userJar: String,\n  sparkConf: SparkConf,\n  hadoopConf: Configuration,\n  secMgr: SecurityManager): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          resolveAndDownloadJars...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          resolveAndDownloadJars is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • DriverWrapper is requested to setupDependencies (Spark Standalone cluster mode)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"tools/JavaMainApplication/","title":"JavaMainApplication","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          JavaMainApplication is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"tools/Main/","title":"Main","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Main\u00a0is the standalone application that is launched from spark-class shell script.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"tools/Main/#main","title":"Launching Application","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          void main(\n  String[] argsArray)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          main requires that at least the class name (className) is given as the first argument in the given argsArray.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          For org.apache.spark.deploy.SparkSubmit class name, main creates a SparkSubmitCommandBuilder and builds a command (with the SparkSubmitCommandBuilder).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Otherwise, main creates a SparkClassCommandBuilder and builds a command (with the SparkClassCommandBuilder).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Class Name AbstractCommandBuilder org.apache.spark.deploy.SparkSubmit SparkSubmitCommandBuilder anything else SparkClassCommandBuilder

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In the end, main prepareWindowsCommand or prepareBashCommand based on the operating system it runs on, MS Windows or non-Windows, respectively.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"tools/Main/#buildCommand","title":"Building Command","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          List<String> buildCommand(\n  AbstractCommandBuilder builder,\n  Map<String, String> env,\n  boolean printLaunchCommand)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          buildCommand requests the given AbstractCommandBuilder to build a command.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          With printLaunchCommand enabled, buildCommand prints out the command to standard error:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Spark Command: [cmd]\n========================================\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SPARK_PRINT_LAUNCH_COMMAND

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          printLaunchCommand is controlled by SPARK_PRINT_LAUNCH_COMMAND environment variable.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"tools/SparkApplication/","title":"SparkApplication","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SparkApplication is an abstraction of entry points to Spark applications that can be started (submitted for execution using spark-submit).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"tools/SparkApplication/#contract","title":"Contract","text":""},{"location":"tools/SparkApplication/#starting-spark-application","title":"Starting Spark Application
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          start(\n  args: Array[String], conf: SparkConf): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkSubmit is requested to submit an application for execution
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"tools/SparkApplication/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ClientApp
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • JavaMainApplication
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • KubernetesClientApplication (Spark on Kubernetes)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • RestSubmissionClientApp
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • YarnClusterApplication
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"tools/SparkClassCommandBuilder/","title":"SparkClassCommandBuilder","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SparkClassCommandBuilder is an AbstractCommandBuilder.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"tools/SparkClassCommandBuilder/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SparkClassCommandBuilder takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Class Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Class Arguments (List<String>)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkClassCommandBuilder is created when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Main standalone application is launched
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/SparkLauncher/","title":"SparkLauncher","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkLauncher is an interface to launch Spark applications programmatically, i.e. from a code (not spark-submit/index.md[spark-submit] directly). It uses a builder pattern to configure a Spark application and launch it as a child process using spark-submit/index.md[spark-submit].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkLauncher uses SparkSubmitCommandBuilder to build the Spark command of a Spark application to launch.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/SparkLauncher/#spark-internal","title":"spark-internal

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkLauncher defines spark-internal (NO_RESOURCE) as a special value to inform Spark not to try to process the application resource (primary resource) as a regular file (but as an imaginary resource that cluster managers would know how to look up and submit for execution, e.g. Spark on YARN or Spark on Kubernetes).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spark-internal special value is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkSubmit is requested to prepareSubmitEnvironment and checks whether to add the primaryResource as part of the following:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • --jar (for Spark on YARN in cluster deploy mode)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • --primary-* arguments and define the --main-class argument (for Spark on Kubernetes in cluster deploy mode with KubernetesClientApplication main class)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkSubmit is requested to check whether a resource is internal or not
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/SparkLauncher/#other","title":"Other

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            .SparkLauncher's Builder Methods to Set Up Invocation of Spark Application [options=\"header\",width=\"100%\"] |=== | Setter | Description | addAppArgs(String... args) | Adds command line arguments for a Spark application. | addFile(String file) | Adds a file to be submitted with a Spark application. | addJar(String jar) | Adds a jar file to be submitted with the application. | addPyFile(String file) | Adds a python file / zip / egg to be submitted with a Spark application. | addSparkArg(String arg) | Adds a no-value argument to the Spark invocation. | addSparkArg(String name, String value) | Adds an argument with a value to the Spark invocation. It recognizes known command-line arguments, i.e. --master, --properties-file, --conf, --class, --jars, --files, and --py-files. | directory(File dir) | Sets the working directory of spark-submit. | redirectError() | Redirects stderr to stdout. | redirectError(File errFile) | Redirects error output to the specified errFile file. | redirectError(ProcessBuilder.Redirect to) | Redirects error output to the specified to Redirect. | redirectOutput(File outFile) | Redirects output to the specified outFile file. | redirectOutput(ProcessBuilder.Redirect to) | Redirects standard output to the specified to Redirect. | redirectToLog(String loggerName) | Sets all output to be logged and redirected to a logger with the specified name. | setAppName(String appName) | Sets the name of an Spark application | setAppResource(String resource) | Sets the main application resource, i.e. the location of a jar file for Scala/Java applications. | setConf(String key, String value) | Sets a Spark property. Expects key starting with spark. prefix. | setDeployMode(String mode) | Sets the deploy mode. | setJavaHome(String javaHome) | Sets a custom JAVA_HOME. | setMainClass(String mainClass) | Sets the main class. | setMaster(String master) | Sets the master URL. | setPropertiesFile(String path) | Sets the internal propertiesFile.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            See spark-AbstractCommandBuilder.md#loadPropertiesFile[loadPropertiesFile Internal Method]. | setSparkHome(String sparkHome) | Sets a custom SPARK_HOME. | setVerbose(boolean verbose) | Enables verbose reporting for SparkSubmit. |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            After the invocation of a Spark application is set up, use launch() method to launch a sub-process that will start the configured Spark application. It is however recommended to use startApplication method instead.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/SparkLauncher/#source-scala","title":"[source, scala]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            import org.apache.spark.launcher.SparkLauncher

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            val command = new SparkLauncher() .setAppResource(\"SparkPi\") .setVerbose(true)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/SparkLauncher/#val-apphandle-commandstartapplication","title":"val appHandle = command.startApplication()","text":""},{"location":"tools/pyspark/","title":"pyspark Shell Script","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            pyspark shell script runs spark-submit with pyspark-shell-main application resource as the first argument followed by --name \"PySparkShell\" option (with other command-line arguments, if specified).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/pyspark/#pyspark-shell","title":"pyspark/shell.py","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            pyspark/shell.py

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Learn more about pyspark/shell.py in The Internals of PySpark.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            pyspark/shell.py module is launched as a PYTHONSTARTUP script.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/pyspark/#environment-variables","title":"Environment Variables","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            pyspark script exports the following environment variables:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • OLD_PYTHONSTARTUP
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • PYSPARK_DRIVER_PYTHON
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • PYSPARK_DRIVER_PYTHON_OPTS
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • PYSPARK_PYTHON
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • PYTHONPATH
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • PYTHONSTARTUP
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/pyspark/#OLD_PYTHONSTARTUP","title":"OLD_PYTHONSTARTUP","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            pyspark defines OLD_PYTHONSTARTUP environment variable to be the initial value of PYTHONSTARTUP (before it gets redefined).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The idea of OLD_PYTHONSTARTUP is to delay execution of the Python startup script until pyspark/shell.py finishes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/pyspark/#PYSPARK_PYTHON","title":"PYSPARK_PYTHON","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            PYSPARK_PYTHON environment variable can be used to specify a Python executable to run PySpark scripts.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The Internals of PySpark

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Learn more about PySpark in The Internals of PySpark.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            PYSPARK_PYTHON can be overriden by PYSPARK_DRIVER_PYTHON and configuration properties when SparkSubmitCommandBuilder is requested to buildPySparkShellCommand.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            PYSPARK_PYTHON is overriden by spark.pyspark.python configuration property, if defined, when SparkSubmitCommandBuilder is requested to buildPySparkShellCommand.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/pyspark/#PYTHONSTARTUP","title":"PYTHONSTARTUP","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            From Python Documentation:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            PYTHONSTARTUP

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If this is the name of a readable file, the Python commands in that file are executed before the first prompt is displayed in interactive mode. The file is executed in the same namespace where interactive commands are executed so that objects defined or imported in it can be used without qualification in the interactive session. You can also change the prompts sys.ps1 and sys.ps2 and the hook sys.__interactivehook__ in this file.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            pyspark (re)defines PYTHONSTARTUP environment variable to be pyspark/shell.py module:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ${SPARK_HOME}/python/pyspark/shell.py\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            OLD_PYTHONSTARTUP

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The initial value of PYTHONSTARTUP environment variable is available as OLD_PYTHONSTARTUP.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-class/","title":"spark-class shell script","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spark-class shell script is the Spark application command-line launcher that is responsible for setting up JVM environment and executing a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: Ultimately, any shell script in Spark, e.g. link:spark-submit.adoc[spark-submit], calls spark-class script.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            You can find spark-class script in bin directory of the Spark distribution.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When started, spark-class first loads $SPARK_HOME/bin/load-spark-env.sh, collects the Spark assembly jars, and executes <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Depending on the Spark distribution (or rather lack thereof), i.e. whether RELEASE file exists or not, it sets SPARK_JARS_DIR environment variable to [SPARK_HOME]/jars or [SPARK_HOME]/assembly/target/scala-[SPARK_SCALA_VERSION]/jars, respectively (with the latter being a local build).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If SPARK_JARS_DIR does not exist, spark-class prints the following error message and exits with the code 1.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Failed to find Spark jars directory ([SPARK_JARS_DIR]).\nYou need to build Spark with the target \"package\" before running this program.\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spark-class sets LAUNCH_CLASSPATH environment variable to include all the jars under SPARK_JARS_DIR.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If SPARK_PREPEND_CLASSES is enabled, [SPARK_HOME]/launcher/target/scala-[SPARK_SCALA_VERSION]/classes directory is added to LAUNCH_CLASSPATH as the first entry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: Use SPARK_PREPEND_CLASSES to have the Spark launcher classes (from [SPARK_HOME]/launcher/target/scala-[SPARK_SCALA_VERSION]/classes) to appear before the other Spark assembly jars. It is useful for development so your changes don't require rebuilding Spark again.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SPARK_TESTING and SPARK_SQL_TESTING environment variables enable test special mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME What's so special about the env vars?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spark-class uses <> command-line application to compute the Spark command to launch. The Main class programmatically computes the command that spark-class executes afterwards.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TIP: Use JAVA_HOME to point at the JVM to use.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[main]] Launching org.apache.spark.launcher.Main Standalone Application

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            org.apache.spark.launcher.Main is a Scala standalone application used in spark-class to prepare the Spark command to execute.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Main expects that the first parameter is the class name that is the \"operation mode\":

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. org.apache.spark.deploy.SparkSubmit -- Main uses link:spark-submit-SparkSubmitCommandBuilder.adoc[SparkSubmitCommandBuilder] to parse command-line arguments. This is the mode link:spark-submit.adoc[spark-submit] uses.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. anything -- Main uses SparkClassCommandBuilder to parse command-line arguments.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ ./bin/spark-class org.apache.spark.launcher.Main\nException in thread \"main\" java.lang.IllegalArgumentException: Not enough arguments: missing class name.\n    at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)\n    at org.apache.spark.launcher.Main.main(Main.java:51)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Main uses buildCommand method on the builder to build a Spark command.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If SPARK_PRINT_LAUNCH_COMMAND environment variable is enabled, Main prints the final Spark command to standard error.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Spark Command: [cmd]\n========================================\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If on Windows it calls prepareWindowsCommand while on non-Windows OSes prepareBashCommand with tokens separated by \u0000\u0000\\0.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            CAUTION: FIXME What's prepareWindowsCommand? prepareBashCommand?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Main uses the following environment variables:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SPARK_DAEMON_JAVA_OPTS and SPARK_MASTER_OPTS to be added to the command line of the command.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SPARK_DAEMON_MEMORY (default: 1g) for -Xms and -Xmx.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-shell/","title":"spark-shell shell script","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Spark shell is an interactive environment where you can learn how to make the most out of Apache Spark quickly and conveniently.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TIP: Spark shell is particularly helpful for fast interactive prototyping.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Under the covers, Spark shell is a standalone Spark application written in Scala that offers environment with auto-completion (using TAB key) where you can run ad-hoc queries and get familiar with the features of Spark (that help you in developing your own standalone Spark applications). It is a very convenient tool to explore the many things available in Spark with immediate feedback. It is one of the many reasons why spark-overview.md#why-spark[Spark is so helpful for tasks to process datasets of any size].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            There are variants of Spark shell for different languages: spark-shell for Scala, pyspark for Python and sparkR for R.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NOTE: This document (and the book in general) uses spark-shell for Scala only.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            You can start Spark shell using <spark-shell script>>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ ./bin/spark-shell\nscala>\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spark-shell is an extension of Scala REPL with automatic instantiation of spark-sql-SparkSession.md[SparkSession] as spark (and SparkContext.md[] as sc).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-shell/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            scala> :type spark org.apache.spark.sql.SparkSession

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            // Learn the current version of Spark in use scala> spark.version res0: String = 2.1.0-SNAPSHOT

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spark-shell also imports spark-sql-SparkSession.md#implicits[Scala SQL's implicits] and spark-sql-SparkSession.md#sql[sql method].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-shell/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            scala> :imports 1) import spark.implicits._ (59 terms, 38 are implicit) 2) import spark.sql (1 terms)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-shell/#note","title":"[NOTE]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            When you execute spark-shell you actually execute spark-submit/index.md[Spark submit] as follows:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-shell/#optionswrap","title":"[options=\"wrap\"]","text":""},{"location":"tools/spark-shell/#orgapachesparkdeploysparksubmit-class-orgapachesparkreplmain-name-spark-shell-spark-shell","title":"org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main --name Spark shell spark-shell","text":""},{"location":"tools/spark-shell/#set-spark_print_launch_command-to-see-the-entire-command-to-be-executed-refer-to-spark-tips-and-tricksmdspark_print_launch_commandprint-launch-command-of-spark-scripts","title":"Set SPARK_PRINT_LAUNCH_COMMAND to see the entire command to be executed. Refer to spark-tips-and-tricks.md#SPARK_PRINT_LAUNCH_COMMAND[Print Launch Command of Spark Scripts].","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            === [[using-spark-shell]] Using Spark shell

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            You start Spark shell using spark-shell script (available in bin directory).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ ./bin/spark-shell\nSetting default log level to \"WARN\".\nTo adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\nWARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\nWARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException\nSpark context Web UI available at http://10.47.71.138:4040\nSpark context available as 'sc' (master = local[*], app id = local-1477858597347).\nSpark session available as 'spark'.\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 2.1.0-SNAPSHOT\n      /_/\n\nUsing Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)\nType in expressions to have them evaluated.\nType :help for more information.\n\nscala>\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Spark shell creates an instance of spark-sql-SparkSession.md[SparkSession] under the name spark for you (so you don't have to know the details how to do it yourself on day 1).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            scala> :type spark\norg.apache.spark.sql.SparkSession\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Besides, there is also sc value created which is an instance of SparkContext.md[].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            scala> :type sc\norg.apache.spark.SparkContext\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            To close Spark shell, you press Ctrl+D or type in :q (or any subset of :quit).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            scala> :q\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-submit/","title":"spark-submit Shell Script","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spark-submit shell script allows managing Spark applications.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            spark-submit is a command-line frontend to SparkSubmit.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-submit/#options","title":"Command-Line Options","text":""},{"location":"tools/spark-submit/#archives","title":"archives","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Command-Line Option: --archives
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Internal Property: archives
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-submit/#deploy-mode","title":"deploy-mode","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Deploy mode

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Command-Line Option: --deploy-mode
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Spark Property: spark.submit.deployMode
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Environment Variable: DEPLOY_MODE
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Internal Property: deployMode
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-submit/#driver-class-path","title":"driver-class-path","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            --driver-class-path\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Extra class path entries (e.g. jars and directories) to pass to a driver's JVM.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            --driver-class-path command-line option sets the extra class path entries (e.g. jars and directories) that should be added to a driver's JVM.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Tip

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Use --driver-class-path in client deploy mode (not SparkConf) to ensure that the CLASSPATH is set up with the entries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            client deploy mode uses the same JVM for the driver as spark-submit's.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Internal Property: driverExtraClassPath

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Spark Property: spark.driver.extraClassPath

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Command-line options (e.g. --driver-class-path) have higher precedence than their corresponding Spark settings in a Spark properties file (e.g. spark.driver.extraClassPath). You can therefore control the final settings by overriding Spark settings on command line using the command-line options.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-submit/#driver-cores","title":"driver-cores","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            --driver-cores NUM\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            --driver-cores command-line option sets the number of cores to NUM for the driver in the cluster deploy mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Spark Property: spark.driver.cores

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Only available for cluster deploy mode (when the driver is executed outside spark-submit).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Internal Property: driverCores

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-submit/#properties-file","title":"properties-file","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            --properties-file [FILE]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            --properties-file command-line option sets the path to a file FILE from which Spark loads extra Spark properties.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Spark uses conf/spark-defaults.conf by default.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-submit/#queue","title":"queue","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            --queue QUEUE_NAME\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            YARN resource queue

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Spark Property: spark.yarn.queue
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Internal Property: queue
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-submit/#version","title":"version","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Command-Line Option: --version

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ ./bin/spark-submit --version\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 2.1.0-SNAPSHOT\n      /_/\n\nBranch master\nCompiled by user jacek on 2016-09-30T07:08:39Z\nRevision 1fad5596885aab8b32d2307c0edecbae50d5bd7a\nUrl https://github.com/apache/spark.git\nType --help for more information.\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-submit/#SPARK_PRINT_LAUNCH_COMMAND","title":"SPARK_PRINT_LAUNCH_COMMAND","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SPARK_PRINT_LAUNCH_COMMAND environment variable allows to have the complete Spark command printed out to the standard output.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            $ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell\nSpark Command: /Library/Ja...\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-submit/SparkSubmit/","title":"SparkSubmit","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkSubmit is the entry point to spark-submit shell script.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-submit/SparkSubmit/#special-primary-resource-names","title":"Special Primary Resource Names

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkSubmit uses the following special primary resource names to represent Spark shells rather than application jars:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • spark-shell
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • pyspark-shell
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • sparkr-shell
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#pyspark-shell","title":"pyspark-shell

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkSubmit uses pyspark-shell when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkSubmit is requested to prepareSubmitEnvironment for .py scripts or pyspark, isShell and isPython
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#isshell","title":"isShell
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isShell(\n  res: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isShell is true when the given res primary resource represents a Spark shell.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isShell\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkSubmit is requested to prepareSubmitEnvironment and isUserJar
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkSubmitArguments is requested to handleUnknown (and determine a primary application resource)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#actions","title":"Actions

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkSubmit executes actions (based on the action argument).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#killing-submission","title":"Killing Submission
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            kill(\n  args: SparkSubmitArguments): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            kill...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#displaying-version","title":"Displaying Version
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            printVersion(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            printVersion...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#submission-status","title":"Submission Status
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            requestStatus(\n  args: SparkSubmitArguments): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            requestStatus...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#submit","title":"Application Submission
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submit(\n  args: SparkSubmitArguments,\n  uninitLog: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            submit doRunMain unless isStandaloneCluster and useRest.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For isStandaloneCluster with useRest requested, submit...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#doRunMain","title":"doRunMain","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            doRunMain(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            doRunMain runMain unless proxyUser is specified.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            With proxyUser specified, doRunMain...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-submit/SparkSubmit/#runMain","title":"Running Main Class","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            runMain(\n  args: SparkSubmitArguments,\n  uninitLog: Boolean): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            runMain prepares submit environment for the given SparkSubmitArguments (that gives childArgs, childClasspath, sparkConf and childMainClass).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            With verbose enabled, runMain prints out the following INFO messages to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Main class:\n[childMainClass]\nArguments:\n[childArgs]\nSpark config:\n[sparkConf_redacted]\nClasspath elements:\n[childClasspath]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            runMain creates and sets a context classloader (based on spark.driver.userClassPathFirst configuration property) and adds the jars (from childClasspath).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            runMain loads the main class (childMainClass).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            runMain creates a SparkApplication (if the main class is a subtype of) or creates a JavaMainApplication (with the main class).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            In the end, runMain requests the SparkApplication to start (with the childArgs and sparkConf).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-submit/SparkSubmit/#cluster-managers","title":"Cluster Managers

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkSubmit has a built-in support for some cluster managers (that are selected based on the master argument).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Nickname Master URL KUBERNETES k8s://-prefix LOCAL local-prefix MESOS mesos-prefix STANDALONE spark-prefix YARN yarn","text":""},{"location":"tools/spark-submit/SparkSubmit/#main","title":"Launching Standalone Application
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            main(\n  args: Array[String]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            main creates a SparkSubmit to doSubmit (with the given args).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#doSubmit","title":"doSubmit
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            doSubmit(\n  args: Array[String]): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            doSubmit initializeLogIfNecessary.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            doSubmit parses the arguments in the given args (that gives a SparkSubmitArguments).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            With verbose option on, doSubmit prints out the appArgs to standard output.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            doSubmit branches off based on action.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Action Handler SUBMIT submit KILL kill REQUEST_STATUS requestStatus PRINT_VERSION printVersion

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            doSubmit is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • InProcessSparkSubmit standalone application is started
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkSubmit standalone application is started
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#parseArguments","title":"Parsing Arguments
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            parseArguments(\n  args: Array[String]): SparkSubmitArguments\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            parseArguments creates a SparkSubmitArguments (with the given args).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#prepareSubmitEnvironment","title":"prepareSubmitEnvironment
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            prepareSubmitEnvironment(\n  args: SparkSubmitArguments,\n  conf: Option[HadoopConfiguration] = None): (Seq[String], Seq[String], SparkConf, String)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            prepareSubmitEnvironment creates a 4-element tuple made up of the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. childArgs for arguments
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. childClasspath for Classpath elements
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            3. sysProps for Spark properties
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            4. childMainClass

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Tip

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Use --verbose command-line option to have the elements of the tuple printed out to the standard output.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            prepareSubmitEnvironment...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For isPython in CLIENT deploy mode, prepareSubmitEnvironment sets the following based on primaryResource:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • For pyspark-shell the mainClass is org.apache.spark.api.python.PythonGatewayServer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Otherwise, the mainClass is org.apache.spark.deploy.PythonRunner and the main python file, extra python files and the childArgs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            prepareSubmitEnvironment...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            prepareSubmitEnvironment determines the cluster manager based on master argument.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            For KUBERNETES, prepareSubmitEnvironment checkAndGetK8sMasterUrl.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            prepareSubmitEnvironment...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            prepareSubmitEnvironment\u00a0is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#childMainClass","title":"childMainClass

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            childMainClass is the last 4th argument in the result tuple of prepareSubmitEnvironment.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            // (childArgs, childClasspath, sparkConf, childMainClass)\n(Seq[String], Seq[String], SparkConf, String)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            childMainClass can be as follows (based on the deployMode):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Deploy Mode Master URL childMainClass client any mainClass cluster KUBERNETES KubernetesClientApplication cluster MESOS RestSubmissionClientApp (for REST submission API) cluster STANDALONE RestSubmissionClientApp (for REST submission API) cluster STANDALONE ClientApp cluster YARN YarnClusterApplication","text":""},{"location":"tools/spark-submit/SparkSubmit/#iskubernetesclient","title":"isKubernetesClient

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            prepareSubmitEnvironment uses isKubernetesClient flag to indicate that:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Cluster manager is Kubernetes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Deploy mode is client
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#iskubernetesclustermodedriver","title":"isKubernetesClusterModeDriver

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            prepareSubmitEnvironment uses isKubernetesClusterModeDriver flag to indicate that:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • isKubernetesClient
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • spark.kubernetes.submitInDriver configuration property is enabled (Spark on Kubernetes)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#renameresourcestolocalfs","title":"renameResourcesToLocalFS
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            renameResourcesToLocalFS(\n  resources: String,\n  localResources: String): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            renameResourcesToLocalFS...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            renameResourcesToLocalFS is used for isKubernetesClusterModeDriver mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#downloadresource","title":"downloadResource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            downloadResource(\n  resource: String): String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            downloadResource...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#checking-whether-resource-is-internal","title":"Checking Whether Resource is Internal
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isInternal(\n  res: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isInternal is true when the given res is spark-internal.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isInternal is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkSubmit is requested to isUserJar
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkSubmitArguments is requested to handleUnknown
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#isuserjar","title":"isUserJar
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isUserJar(\n  res: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isUserJar is true when the given res is none of the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • isShell
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • isPython
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • isInternal
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • isR

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isUserJar is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • FIXME
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmit/#isPython","title":"isPython
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isPython(\n  res: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isPython is positive (true) when the given res primary resource represents a PySpark application:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • .py script
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • pyspark-shell

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            isPython is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkSubmit is requested to isUserJar
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SparkSubmitArguments is requested to handle an unknown option
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/","title":"SparkSubmitArguments","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkSubmitArguments is created\u00a0 for SparkSubmit to parseArguments.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkSubmitArguments is a custom SparkSubmitArgumentsParser to handle the command-line arguments of spark-submit script that the actions use for execution (possibly with the explicit env environment).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkSubmitArguments is created when launching spark-submit script with only args passed in and later used for printing the arguments in verbose mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"tools/spark-submit/SparkSubmitArguments/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkSubmitArguments takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Arguments (Seq[String])
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Environment Variables (default: sys.env)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkSubmitArguments is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmit is requested to parseArguments
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/spark-submit/SparkSubmitArguments/#action","title":"Action","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              action: SparkSubmitAction\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              action is used by SparkSubmit to determine what to do when executed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              action can be one of the following SparkSubmitActions:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Action Description SUBMIT The default action if none specified KILL Indicates --kill switch REQUEST_STATUS Indicates --status switch PRINT_VERSION Indicates --version switch

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              action is undefined (null) by default (when SparkSubmitAction is created).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              action is validated when validateArguments.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/spark-submit/SparkSubmitArguments/#command-line-options","title":"Command-Line Options","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#-files","title":"--files
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Configuration Property: spark.files
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Configuration Property (Spark on YARN): spark.yarn.dist.files

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Printed out to standard output for --verbose option

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              When SparkSubmit is requested to prepareSubmitEnvironment, the files are:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • resolveGlobPaths
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • downloadFileList
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • renameResourcesToLocalFS
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • downloadResource
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#loading-spark-properties","title":"Loading Spark Properties
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              loadEnvironmentArguments(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              loadEnvironmentArguments loads the Spark properties for the current execution of spark-submit.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              loadEnvironmentArguments reads command-line options first followed by Spark properties and System's environment variables.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Spark config properties start with spark. prefix and can be set using --conf [key=value] command-line option.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#handle","title":"Option Handling SparkSubmitOptionParser
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              handle(\n  opt: String,\n  value: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              handle is part of the SparkSubmitOptionParser abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              handle parses the input opt argument and assigns the given value to corresponding properties.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, handle returns whether it was executed for any action but PRINT_VERSION.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              User Option (opt) Property --kill action --name name --status action --version action ... ...","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#mergedefaultsparkproperties","title":"mergeDefaultSparkProperties
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              mergeDefaultSparkProperties(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              mergeDefaultSparkProperties merges Spark properties from the default Spark properties file, i.e. spark-defaults.conf with those specified through --conf command-line option.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#isPython","title":"isPython
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              isPython: Boolean = false\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              isPython indicates whether the application resource is a PySpark application (a Python script or pyspark shell).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              isPython is isPython when SparkSubmitArguments is requested to handle a unknown option.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#client-deploy-mode","title":"Client Deploy Mode

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              With isPython flag enabled, SparkSubmit determines the mainClass (and the childArgs) based on the primaryResource.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              primaryResource mainClass pyspark-shell org.apache.spark.api.python.PythonGatewayServer (PySpark) anything else org.apache.spark.deploy.PythonRunner (PySpark)","text":""},{"location":"tools/spark-submit/SparkSubmitCommandBuilder.OptionParser/","title":"SparkSubmitCommandBuilder.OptionParser","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkSubmitCommandBuilder.OptionParser is...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/","title":"SparkSubmitCommandBuilder","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkSubmitCommandBuilder is an AbstractCommandBuilder.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkSubmitCommandBuilder is used to build a command that spark-submit and SparkLauncher use to launch a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkSubmitCommandBuilder uses the first argument to distinguish the shells:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1. pyspark-shell-main
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              2. sparkr-shell-main
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              3. run-example

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkSubmitCommandBuilder parses command-line arguments using OptionParser (which is a spark-submit-SparkSubmitOptionParser.md[SparkSubmitOptionParser]). OptionParser comes with the following methods:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1. handle to handle the known options (see the table below). It sets up master, deployMode, propertiesFile, conf, mainClass, sparkArgs internal properties.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              2. handleUnknown to handle unrecognized options that usually lead to Unrecognized option error message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              3. handleExtraArgs to handle extra arguments that are considered a Spark application's arguments.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              For spark-shell it assumes that the application arguments are after spark-submit's arguments.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#pyspark-shell-main","title":"pyspark-shell-main Application Resource

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              When bin/pyspark shell script (and bin\\pyspark2.cmd) are launched, they use bin/spark-submit with pyspark-shell-main application resource as the first argument (followed by --name \"PySparkShell\" option among the others).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              pyspark-shell-main is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmitCommandBuilder is created and then requested to build a command (buildPySparkShellCommand actually)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#buildCommand","title":"Building Command AbstractCommandBuilder
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              List<String> buildCommand(\n  Map<String, String> env)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              buildCommand is part of the AbstractCommandBuilder abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              buildCommand branches off based on the application resource.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Application Resource Command Builder pyspark-shell-main (but not isSpecialCommand) buildPySparkShellCommand sparkr-shell-main (but not isSpecialCommand) buildSparkRCommand anything else buildSparkSubmitCommand","text":""},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#buildPySparkShellCommand","title":"buildPySparkShellCommand","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              List<String> buildPySparkShellCommand(\n  Map<String, String> env)\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              appArgs expected to be empty

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              buildPySparkShellCommand makes sure that:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • There are no appArgs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • If there are appArgs the first argument is not a Python script (a file with .py extension)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              buildPySparkShellCommand sets the application resource as pyspark-shell.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              pyspark-shell-main redefined to pyspark-shell

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              buildPySparkShellCommand is executed when requested for a command with pyspark-shell-main application resource that is re-defined (reset) to pyspark-shell now.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              buildPySparkShellCommand constructEnvVarArgs with the given env and PYSPARK_SUBMIT_ARGS.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              buildPySparkShellCommand defines an internal pyargs collection for the parts of the shell command to execute.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              buildPySparkShellCommand stores the Python executable (in pyargs) to be the first specified in the following order:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • spark.pyspark.driver.python configuration property
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • spark.pyspark.python configuration property
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • PYSPARK_DRIVER_PYTHON environment variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • PYSPARK_PYTHON environment variable
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • python3

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              buildPySparkShellCommand sets the environment variables (for the Python executable to use), if specified.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Environment Variable Configuration Property PYSPARK_PYTHON spark.pyspark.python SPARK_REMOTE remote option or spark.remote

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In the end, buildPySparkShellCommand copies all the options from PYSPARK_DRIVER_PYTHON_OPTS, if specified.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#buildSparkSubmitCommand","title":"buildSparkSubmitCommand","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              List<String> buildSparkSubmitCommand(\n  Map<String, String> env)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              buildSparkSubmitCommand starts by building so-called effective config. When in client mode, buildSparkSubmitCommand adds spark.driver.extraClassPath to the result Spark command.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              buildSparkSubmitCommand builds the first part of the Java command passing in the extra classpath (only for client deploy mode).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Add isThriftServer case

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              buildSparkSubmitCommand appends SPARK_SUBMIT_OPTS and SPARK_JAVA_OPTS environment variables.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              (only for client deploy mode) ...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Elaborate on the client deply mode case

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              addPermGenSizeOpt case...elaborate

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Elaborate on addPermGenSizeOpt

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              buildSparkSubmitCommand appends org.apache.spark.deploy.SparkSubmit and the command-line arguments (using buildSparkSubmitArgs).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#buildsparksubmitargs","title":"buildSparkSubmitArgs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              List<String> buildSparkSubmitArgs()\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              buildSparkSubmitArgs builds a list of command-line arguments for spark-submit.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              buildSparkSubmitArgs uses a SparkSubmitOptionParser to add the command-line arguments that spark-submit recognizes (when it is executed later on and uses the very same SparkSubmitOptionParser parser to parse command-line arguments).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              buildSparkSubmitArgs is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • InProcessLauncher is requested to startApplication
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkLauncher is requested to createBuilder
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmitCommandBuilder is requested to buildSparkSubmitCommand and constructEnvVarArgs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#sparksubmitcommandbuilder-properties-and-sparksubmitoptionparser-attributes","title":"SparkSubmitCommandBuilder Properties and SparkSubmitOptionParser Attributes SparkSubmitCommandBuilder Property SparkSubmitOptionParser Attribute verbose VERBOSE master MASTER [master] deployMode DEPLOY_MODE [deployMode] appName NAME [appName] conf CONF [key=value]* propertiesFile PROPERTIES_FILE [propertiesFile] jars JARS [comma-separated jars] files FILES [comma-separated files] pyFiles PY_FILES [comma-separated pyFiles] mainClass CLASS [mainClass] sparkArgs sparkArgs (passed straight through) appResource appResource (passed straight through) appArgs appArgs (passed straight through)","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/","title":"SparkSubmitOperation","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkSubmitOperation is an abstraction of operations of spark-submit (when requested to kill a submission or for a submission status).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/spark-submit/SparkSubmitOperation/#contract","title":"Contract","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/#killing-submission","title":"Killing Submission
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              kill(\n  submissionId: String,\n  conf: SparkConf): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Kills a given submission

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmit is requested to kill a submission
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/#displaying-submission-status","title":"Displaying Submission Status
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              printSubmissionStatus(\n  submissionId: String,\n  conf: SparkConf): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Displays status of a given submission

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmit is requested for submission status
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/#checking-whether-master-url-supported","title":"Checking Whether Master URL Supported
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              supports(\n  master: String): Boolean\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmit is requested to kill a submission and for a submission status (via getSubmitOperations utility)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • K8SSparkSubmitOperation (Spark on Kubernetes)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/spark-submit/SparkSubmitOptionParser/","title":"SparkSubmitOptionParser","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkSubmitOptionParser is the parser of spark-submit's command-line options.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/spark-submit/SparkSubmitOptionParser/#parse","title":"Parsing Arguments","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              void parse(\n  List<String> args)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              parse...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              parse is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • AbstractLauncher is requested to addSparkArg
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Main is launched
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmitCommandBuilder is created and requested to buildSparkSubmitArgs
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/spark-submit/SparkSubmitOptionParser/#handle","title":"Option Handling","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              boolean handle(\n  String opt,\n  String value)\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              handle throws an UnsupportedOperationException (and expects subclasses to override the default behaviour, e.g. SparkSubmitArguments).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/spark-submit/SparkSubmitOptionParser/#-files","title":"--files

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              A comma-separated sequence of paths

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"tools/spark-submit/SparkSubmitUtils/","title":"SparkSubmitUtils","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              SparkSubmitUtils provides utilities for SparkSubmit.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"tools/spark-submit/SparkSubmitUtils/#getsubmitoperations","title":"getSubmitOperations
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getSubmitOperations(\n  master: String): SparkSubmitOperation\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getSubmitOperations...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              getSubmitOperations\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • SparkSubmit is requested to kill a submission and requestStatus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/","title":"Web UIs","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              web UI is the web interface of Spark applications or infrastructure for monitoring and inspection.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The main abstraction is WebUI.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"webui/AllJobsPage/","title":"AllJobsPage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              AllJobsPage is a WebUIPage of JobsTab.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"webui/AllJobsPage/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              AllJobsPage takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Parent JobsTab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • AppStatusStore"},{"location":"webui/AllJobsPage/#rendering-page","title":"Rendering Page
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                render(\n  request: HttpServletRequest): Seq[Node]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                render\u00a0is part of the WebUIPage abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                render renders a Spark Jobs page with the jobs and executors alongside applicationInfo and appSummary (from the AppStatusStore).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"webui/AllJobsPage/#introduction","title":"Introduction

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                AllJobsPage renders a summary, an event timeline, and active, completed, and failed jobs of a Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                AllJobsPage displays the Summary section with the current Spark user, total uptime, scheduling mode, and the number of jobs per status.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Under the summary section is the Event Timeline section.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Active Jobs, Completed Jobs, and Failed Jobs sections follow.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Jobs are clickable (and give information about the stages of tasks inside it).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When you hover over a job in Event Timeline not only you see the job legend but also the job is highlighted in the Summary section.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The Event Timeline section shows not only jobs but also executors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ","text":""},{"location":"webui/AllStagesPage/","title":"AllStagesPage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                AllStagesPage is a WebUIPage of StagesTab.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"webui/AllStagesPage/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                AllStagesPage takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Parent StagesTab"},{"location":"webui/AllStagesPage/#rendering-page","title":"Rendering Page
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  render(\n  request: HttpServletRequest): Seq[Node]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  render\u00a0is part of the WebUIPage abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  render renders a Stages for All Jobs page with the stages and application summary (from the AppStatusStore of the parent StagesTab).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"webui/AllStagesPage/#stage-headers","title":"Stage Headers

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  AllStagesPage uses the following headers and tooltips for the Stages table.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Header Tooltip Stage Id Pool Name Description Submitted Duration Elapsed time since the stage was submitted until execution completion of all its tasks. Tasks: Succeeded/Total Input Bytes read from Hadoop or from Spark storage. Output Bytes written to Hadoop. Shuffle Read Total shuffle bytes and records read (includes both data read locally and data read from remote executors). Shuffle Write Bytes and records written to disk in order to be read by a shuffle in a future stage. Failure Reason Bytes and records written to disk in order to be read by a shuffle in a future stage.","text":""},{"location":"webui/EnvironmentPage/","title":"EnvironmentPage","text":""},{"location":"webui/EnvironmentPage/#review-me","title":"Review Me","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[prefix]] EnvironmentPage is a spark-webui-WebUIPage.md[WebUIPage] with an empty spark-webui-WebUIPage.md#prefix[prefix].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  EnvironmentPage is <> exclusively when EnvironmentTab is spark-webui-EnvironmentTab.md#creating-instance[created].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[creating-instance]] Creating EnvironmentPage Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  EnvironmentPage takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[parent]] Parent spark-webui-EnvironmentTab.md[EnvironmentTab]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[conf]] SparkConf.md[SparkConf]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[store]] core:AppStatusStore.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/EnvironmentTab/","title":"EnvironmentTab","text":""},{"location":"webui/EnvironmentTab/#review-me","title":"Review Me","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[prefix]] EnvironmentTab is a spark-webui-SparkUITab.md[SparkUITab] with environment spark-webui-SparkUITab.md#prefix[prefix].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  EnvironmentTab is <> exclusively when SparkUI is spark-webui-SparkUI.md#initialize[initialized].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[creating-instance]] EnvironmentTab takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[parent]] Parent spark-webui-SparkUI.md[SparkUI]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[store]] core:AppStatusStore.md[]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When created, EnvironmentTab creates the spark-webui-EnvironmentPage.md#creating-instance[EnvironmentPage] page and spark-webui-WebUITab.md#attachPage[attaches] it immediately.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/ExecutorThreadDumpPage/","title":"ExecutorThreadDumpPage","text":""},{"location":"webui/ExecutorThreadDumpPage/#review-me","title":"Review Me","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[prefix]] ExecutorThreadDumpPage is a spark-webui-WebUIPage.md[WebUIPage] with threadDump spark-webui-WebUIPage.md#prefix[prefix].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorThreadDumpPage is <> exclusively when ExecutorsTab is spark-webui-ExecutorsTab.md#creating-instance[created] (with spark.ui.threadDumpsEnabled configuration property enabled).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: spark.ui.threadDumpsEnabled configuration property is enabled (i.e. true) by default.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  === [[creating-instance]] Creating ExecutorThreadDumpPage Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorThreadDumpPage takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[parent]] spark-webui-SparkUITab.md[SparkUITab]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[sc]] Optional SparkContext.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/ExecutorsPage/","title":"ExecutorsPage","text":""},{"location":"webui/ExecutorsPage/#review-me","title":"Review Me","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[prefix]] ExecutorsPage is a spark-webui-WebUIPage.md[WebUIPage] with an empty spark-webui-WebUIPage.md#prefix[prefix].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorsPage is <> exclusively when ExecutorsTab is spark-webui-ExecutorsTab.md#creating-instance[created].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  === [[creating-instance]] Creating ExecutorsPage Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorsPage takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[parent]] Parent spark-webui-SparkUITab.md[SparkUITab]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[threadDumpEnabled]] threadDumpEnabled flag
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/ExecutorsTab/","title":"ExecutorsTab","text":""},{"location":"webui/ExecutorsTab/#review-me","title":"Review Me","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[prefix]] ExecutorsTab is a spark-webui-SparkUITab.md[SparkUITab] with executors spark-webui-SparkUITab.md#prefix[prefix].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ExecutorsTab is <> exclusively when SparkUI is spark-webui-SparkUI.md#initialize[initialized].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[creating-instance]] [[parent]] ExecutorsTab takes the parent spark-webui-SparkUI.md[SparkUI] when created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When <>, ExecutorsTab creates the following pages and spark-webui-WebUITab.md#attachPage[attaches] them immediately:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • spark-webui-ExecutorsPage.md[ExecutorsPage]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • spark-webui-ExecutorThreadDumpPage.md[ExecutorThreadDumpPage]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/JettyUtils/","title":"JettyUtils","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  == [[JettyUtils]] JettyUtils

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  JettyUtils is a set of <> for creating Jetty HTTP Server-specific components.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[utility-methods]] .JettyUtils's Utility Methods [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | <> | Creates an HttpServlet

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | <> | Creates a Handler for a static content

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | <> | Creates a ServletContextHandler for a path <> ===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  === [[createServletHandler]] Creating ServletContextHandler for Path -- createServletHandler Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/JettyUtils/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createServletHandler( path: String, servlet: HttpServlet, basePath: String): ServletContextHandler createServletHandlerT <: AnyRef: ServletContextHandler // <1>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  <1> Uses the first three-argument createServletHandler

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createServletHandler...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/JettyUtils/#note","title":"[NOTE]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createServletHandler is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • WebUI is requested to spark-webui-WebUI.md#attachPage[attachPage]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • MetricsServlet is requested to getHandlers

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/JettyUtils/#spark-standalones-workerwebui-is-requested-to-initialize","title":"* Spark Standalone's WorkerWebUI is requested to initialize","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  === [[createServlet]] Creating HttpServlet -- createServlet Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/JettyUtils/#source-scala_1","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createServletT <: AnyRef: HttpServlet

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createServlet creates the X-Frame-Options header that can be either ALLOW-FROM with the value of spark-webui-properties.md#spark.ui.allowFramingFrom[spark.ui.allowFramingFrom] configuration property if defined or SAMEORIGIN.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createServlet creates a Java Servlets HttpServlet with support for GET requests.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  When handling GET requests, the HttpServlet first checks view permissions of the remote user (by requesting the SecurityManager to checkUIViewPermissions of the remote user).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/JettyUtils/#tip","title":"[TIP]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Enable DEBUG logging level for org.apache.spark.SecurityManager logger to see what happens when SecurityManager does the security check.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  log4j.logger.org.apache.spark.SecurityManager=DEBUG\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  You should see the following DEBUG message in the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/JettyUtils/#debug-securitymanager-useruser-aclsenabledaclsenabled-viewaclsviewacls-viewaclsgroupsviewaclsgroups","title":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  DEBUG SecurityManager: user=[user] aclsEnabled=[aclsEnabled] viewAcls=[viewAcls] viewAclsGroups=[viewAclsGroups]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  With view permissions check passed, the HttpServlet sends a response with the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  In case the view permissions didn't allow to view the page, the HttpServlet sends an error response with the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Status 403

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Cache-Control header with \"no-cache, no-store, must-revalidate\"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Error message: \"User is not authorized to access this page.\"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: createServlet is used exclusively when JettyUtils is requested to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  === [[createStaticHandler]] Creating Handler For Static Content -- createStaticHandler Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/JettyUtils/#source-scala_2","title":"[source, scala]","text":""},{"location":"webui/JettyUtils/#createstatichandlerresourcebase-string-path-string-servletcontexthandler","title":"createStaticHandler(resourceBase: String, path: String): ServletContextHandler","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createStaticHandler creates a handler for serving files from a static directory

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Internally, createStaticHandler creates a Jetty ServletContextHandler and sets org.eclipse.jetty.servlet.Default.gzip init parameter to false.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createRedirectHandler creates a Jetty https://www.eclipse.org/jetty/javadoc/current/org/eclipse/jetty/servlet/DefaultServlet.html[DefaultServlet].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/JettyUtils/#note_1","title":"[NOTE]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Quoting the official documentation of Jetty's https://www.eclipse.org/jetty/javadoc/current/org/eclipse/jetty/servlet/DefaultServlet.html[DefaultServlet]:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  DefaultServlet The default servlet. This servlet, normally mapped to /, provides the handling for static content, OPTION and TRACE methods for the context. The following initParameters are supported, these can be set either on the servlet itself or as ServletContext initParameters with a prefix of org.eclipse.jetty.servlet.Default.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  With that, org.eclipse.jetty.servlet.Default.gzip is to configure https://www.eclipse.org/jetty/documentation/current/advanced-extras.html#default-servlet-init[gzip] init parameter for Jetty's DefaultServlet.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  gzip If set to true, then static content will be served as gzip content encoded if a matching resource is found ending with \".gz\" (default false) (deprecated: use precompressed)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ====

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createRedirectHandler resolves the resourceBase in the Spark classloader and, if successful, sets resourceBase init parameter of the Jetty DefaultServlet to the URL.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: https://www.eclipse.org/jetty/documentation/current/advanced-extras.html#default-servlet-init[resourceBase] init parameter is used to replace the context resource base.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createRedirectHandler requests the ServletContextHandler to use the path as the context path and register the DefaultServlet to serve it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createRedirectHandler throws an Exception if the input resourceBase could not be resolved.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Could not find resource path for Web UI: [resourceBase]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: createStaticHandler is used when spark-webui-SparkUI.md#initialize[SparkUI], spark-history-server:HistoryServer.md#initialize[HistoryServer], Spark Standalone's MasterWebUI and WorkerWebUI, Spark on Mesos' MesosClusterUI are requested to initialize.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  === [[createRedirectHandler]] createRedirectHandler Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/JettyUtils/#source-scala_3","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createRedirectHandler( srcPath: String, destPath: String, beforeRedirect: HttpServletRequest => Unit = x => (), basePath: String = \"\", httpMethods: Set[String] = Set(\"GET\")): ServletContextHandler

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  createRedirectHandler...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  NOTE: createRedirectHandler is used when spark-webui-SparkUI.md#initialize[SparkUI] and Spark Standalone's MasterWebUI are requested to initialize.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/JobPage/","title":"JobPage","text":""},{"location":"webui/JobPage/#review-me","title":"Review Me","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [[prefix]] JobPage is a spark-webui-WebUIPage.md[WebUIPage] with job spark-webui-WebUIPage.md#prefix[prefix].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  JobPage is <> exclusively when JobsTab is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  === [[creating-instance]] Creating JobPage Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  JobPage takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[parent]] Parent JobsTab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • [[store]] core:AppStatusStore.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/JobsTab/","title":"JobsTab","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  JobsTab is a SparkUITab with jobs URL prefix.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/JobsTab/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  JobsTab takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Parent SparkUI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • AppStatusStore

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    JobsTab is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkUI is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"webui/JobsTab/#pages","title":"Pages","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    When created, JobsTab attaches the following pages (with a reference to itself and the AppStatusStore):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • AllJobsPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • JobPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"webui/JobsTab/#event-timeline","title":"Event Timeline","text":""},{"location":"webui/JobsTab/#details-for-job","title":"Details for Job","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Clicking a job in AllJobsPage, leads to Details for Job page.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    When a job id is not found, you should see \"No information to display for job ID\" message.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"webui/PoolPage/","title":"PoolPage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    PoolPage is a WebUIPage of StagesTab.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"webui/PoolPage/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    PoolPage takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Parent StagesTab"},{"location":"webui/PoolPage/#url-prefix","title":"URL Prefix

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      PoolPage uses pool URL prefix.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"webui/PoolPage/#rendering-page","title":"Rendering Page
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      render(\n  request: HttpServletRequest): Seq[Node]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      render\u00a0is part of the WebUIPage abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      render requires poolname and attempt request parameters.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      render renders a Fair Scheduler Pool page with the PoolData (from the AppStatusStore of the parent StagesTab).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"webui/PoolPage/#introduction","title":"Introduction

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The Fair Scheduler Pool Details page shows information about a Schedulable pool and is only available when a Spark application uses the FAIR scheduling mode.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"webui/PoolPage/#summary-table","title":"Summary Table","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The Summary table shows the details of a Schedulable pool.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      It uses the following columns:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Pool Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Minimum Share
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Pool Weight
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Active Stages (the number of the active stages in a Schedulable pool)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Running Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SchedulingMode
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/PoolPage/#active-stages-table","title":"Active Stages Table","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The Active Stages table shows the active stages in a pool.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/PrometheusResource/","title":"PrometheusResource","text":""},{"location":"webui/PrometheusResource/#getservlethandler","title":"getServletHandler
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getServletHandler(\n  uiRoot: UIRoot): ServletContextHandler\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getServletHandler...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      getServletHandler\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SparkUI is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"webui/RDDPage/","title":"RDDPage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      == [[RDDPage]] RDDPage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [[prefix]] RDDPage is a spark-webui-WebUIPage.md[WebUIPage] with rdd spark-webui-WebUIPage.md#prefix[prefix].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      RDDPage is <> exclusively when StorageTab is spark-webui-StorageTab.md#creating-instance[created].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [[creating-instance]] RDDPage takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • [[parent]] Parent spark-webui-SparkUITab.md[SparkUITab]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • [[store]] core:AppStatusStore.md[]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      === [[render]] render Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/RDDPage/#source-scala","title":"[source, scala]","text":""},{"location":"webui/RDDPage/#renderrequest-httpservletrequest-seqnode","title":"render(request: HttpServletRequest): Seq[Node]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NOTE: render is part of spark-webui-WebUIPage.md#render[WebUIPage Contract] to...FIXME.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      render...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/SparkUI/","title":"SparkUI","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      SparkUI is a WebUI of Spark applications.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/SparkUI/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      SparkUI takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • AppStatusStore
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SparkContext
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • SecurityManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Application Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Base Path
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Start Time
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Spark Version

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        While being created, SparkUI initializes itself.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        SparkUI is created using create utility.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"webui/SparkUI/#ui-port","title":"UI Port
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getUIPort(\n  conf: SparkConf): Int\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getUIPort requests the SparkConf for the value of spark.ui.port configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        getUIPort\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkUI is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"webui/SparkUI/#creating-sparkui","title":"Creating SparkUI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        create(\n  sc: Option[SparkContext],\n  store: AppStatusStore,\n  conf: SparkConf,\n  securityManager: SecurityManager,\n  appName: String,\n  basePath: String,\n  startTime: Long,\n  appSparkVersion: String): SparkUI\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        create creates a new SparkUI with appSparkVersion being the current Spark version.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        create\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkContext is created (with the spark.ui.enabled configuration property turned on)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • FsHistoryProvider (Spark History Server) is requested for the web UI of a Spark application
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"webui/SparkUI/#initializing","title":"Initializing
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        initialize(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        initialize\u00a0is part of the WebUI abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        initialize creates and attaches the following tabs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        1. JobsTab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        2. StagesTab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        3. StorageTab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        4. EnvironmentTab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        5. ExecutorsTab

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        initialize attaches itself as the UIRoot.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        initialize attaches the PrometheusResource for executor metrics based on spark.ui.prometheus.enabled configuration property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"webui/SparkUI/#uiroot","title":"UIRoot

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        SparkUI is an UIRoot

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"webui/SparkUI/#review-me","title":"Review Me

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        SparkUI is <> along with the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkContext is created (for a live Spark application with spark-webui-properties.md#spark.ui.enabled[spark.ui.enabled] configuration property enabled)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • FsHistoryProvider is requested for the spark-history-server:FsHistoryProvider.md#getAppUI[application UI] (for a live or completed Spark application)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        .Creating SparkUI for Live Spark Application image::spark-webui-SparkUI.png[align=\"center\"]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When <> (while SparkContext is created for a live Spark application), SparkUI gets the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Live AppStatusStore (with a ElementTrackingStore using an core:InMemoryStore.md[] and a AppStatusListener for a live Spark application)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Name of the Spark application that is exactly the value of SparkConf.md#spark.app.name[spark.app.name] configuration property

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Empty base path

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        When started, SparkUI binds to <> address that you can control using SPARK_PUBLIC_DNS environment variable or spark-driver.md#spark_driver_host[spark.driver.host] Spark property.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        NOTE: With spark-webui-properties.md#spark.ui.killEnabled[spark.ui.killEnabled] configuration property turned on, SparkUI <> (subject to SecurityManager.checkModifyPermissions permissions).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        SparkUI gets an <> that is then used for the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • <>, i.e. JobsTab.md#creating-instance[JobsTab], spark-webui-StagesTab.md#creating-instance[StagesTab], spark-webui-StorageTab.md#creating-instance[StorageTab], spark-webui-EnvironmentTab.md#creating-instance[EnvironmentTab]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • AbstractApplicationResource is requested for spark-api-AbstractApplicationResource.md#jobsList[jobsList], spark-api-AbstractApplicationResource.md#oneJob[oneJob], spark-api-AbstractApplicationResource.md#executorList[executorList], spark-api-AbstractApplicationResource.md#allExecutorList[allExecutorList], spark-api-AbstractApplicationResource.md#rddList[rddList], spark-api-AbstractApplicationResource.md#rddData[rddData], spark-api-AbstractApplicationResource.md#environmentInfo[environmentInfo]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • StagesResource is requested for spark-api-StagesResource.md#stageList[stageList], spark-api-StagesResource.md#stageData[stageData], spark-api-StagesResource.md#oneAttemptData[oneAttemptData], spark-api-StagesResource.md#taskSummary[taskSummary], spark-api-StagesResource.md#taskList[taskList]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkUI is requested for the current <>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Creating Spark SQL's SQLTab (when SQLHistoryServerPlugin is requested to setupUI)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Spark Streaming's BatchPage is created

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • [[internal-registries]] .SparkUI's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | appId | [[appId]] |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/SparkUI/#tip","title":"[TIP]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Enable INFO logging level for org.apache.spark.ui.SparkUI logger to see what happens inside.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Add the following line to conf/log4j.properties:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          log4j.logger.org.apache.spark.ui.SparkUI=INFO\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"webui/SparkUI/#refer-to-spark-loggingmdlogging","title":"Refer to spark-logging.md[Logging].","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[setAppId]] Assigning Unique Identifier of Spark Application -- setAppId Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"webui/SparkUI/#source-scala","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#setappidid-string-unit","title":"setAppId(id: String): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          setAppId sets the internal <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          setAppId is used when SparkContext is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[stop]] Stopping SparkUI -- stop Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/SparkUI/#source-scala_1","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#stop-unit","title":"stop(): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          stop stops the HTTP server and prints the following INFO message to the logs:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          INFO SparkUI: Stopped Spark web UI at [appUIAddress]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: appUIAddress in the above INFO message is the result of <> method.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[appUIAddress]] appUIAddress Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/SparkUI/#source-scala_2","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#appuiaddress-string","title":"appUIAddress: String

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          appUIAddress returns the entire URL of a Spark application's web UI, including http:// scheme.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Internally, appUIAddress uses <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[createLiveUI]] createLiveUI Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/SparkUI/#source-scala_3","title":"[source, scala]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          createLiveUI( sc: SparkContext, conf: SparkConf, listenerBus: SparkListenerBus, jobProgressListener: JobProgressListener, securityManager: SecurityManager, appName: String, startTime: Long): SparkUI

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          createLiveUI creates a SparkUI for a live running Spark application.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Internally, createLiveUI simply forwards the call to <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          createLiveUI is used when SparkContext is created.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[createHistoryUI]] createHistoryUI Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          CAUTION: FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[appUIHostPort]] appUIHostPort Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/SparkUI/#source-scala_4","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#appuihostport-string","title":"appUIHostPort: String

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          appUIHostPort returns the Spark application's web UI which is the public hostname and port, excluding the scheme.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: <> uses appUIHostPort and adds http:// scheme.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[getAppName]] getAppName Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/SparkUI/#source-scala_5","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#getappname-string","title":"getAppName: String

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          getAppName returns the name of the Spark application (of a SparkUI instance).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: getAppName is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[create]] Creating SparkUI Instance -- create Factory Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/SparkUI/#source-scala_6","title":"[source, scala]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          create( sc: Option[SparkContext], store: AppStatusStore, conf: SparkConf, securityManager: SecurityManager, appName: String, basePath: String = \"\", startTime: Long, appSparkVersion: String = org.apache.spark.SPARK_VERSION): SparkUI

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          create creates a SparkUI backed by a core:AppStatusStore.md[].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Internally, create simply creates a new <> (with the predefined Spark version).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          create is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkContext is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • FsHistoryProvider is requested to spark-history-server:FsHistoryProvider.md#getAppUI[getAppUI] (for a Spark application that already finished)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/SparkUI/#creating-instance_1","title":"Creating Instance

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SparkUI takes the following when created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[store]] core:AppStatusStore.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[sc]] SparkContext.md[]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[conf]] SparkConf.md[SparkConf]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[securityManager]] SecurityManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[appName]] Application name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[basePath]] basePath
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[startTime]] Start time
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • [[appSparkVersion]] appSparkVersion

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SparkUI initializes the <> and <>.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          == [[initialize]] Attaching Tabs and Context Handlers -- initialize Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/SparkUI/#source-scala_7","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#initialize-unit","title":"initialize(): Unit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          NOTE: initialize is part of spark-webui-WebUI.md#initialize[WebUI Contract] to initialize web components.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          initialize creates and <> the following tabs (with the reference to the SparkUI and its <>):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          . spark-webui-StagesTab.md[StagesTab] . spark-webui-StorageTab.md[StorageTab] . spark-webui-EnvironmentTab.md[EnvironmentTab] . spark-webui-ExecutorsTab.md[ExecutorsTab]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          In the end, initialize creates and spark-webui-WebUI.md#attachHandler[attaches] the following ServletContextHandlers:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          . spark-webui-JettyUtils.md#createStaticHandler[Creates a static handler] for serving files from a static directory, i.e. /static to serve static files from org/apache/spark/ui/static directory (on CLASSPATH)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          . spark-api-ApiRootResource.md#getServletHandler[Creates the /api/* context handler] for the spark-api.md[Status REST API]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          . spark-webui-JettyUtils.md#createRedirectHandler[Creates a redirect handler] to redirect /jobs/job/kill to /jobs/ and request the JobsTab to execute handleKillRequest before redirection

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          . spark-webui-JettyUtils.md#createRedirectHandler[Creates a redirect handler] to redirect /stages/stage/kill to /stages/ and request the StagesTab to execute spark-webui-StagesTab.md#handleKillRequest[handleKillRequest] before redirection

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/SparkUITab/","title":"SparkUITab","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SparkUITab\u00a0is an extension of the WebUITab abstraction for UI tabs with the application name and Spark version.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"webui/SparkUITab/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • EnvironmentTab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ExecutorsTab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • JobsTab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • StagesTab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • StorageTab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"webui/SparkUITab/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          SparkUITab takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Parent SparkUI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • URL Prefix Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SparkUITab\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete SparkUITabs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"webui/SparkUITab/#application-name","title":"Application Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            appName: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            appName requests the parent SparkUI for the appName.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"webui/SparkUITab/#spark-version","title":"Spark Version
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            appSparkVersion: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            appSparkVersion requests the parent SparkUI for the appSparkVersion.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ","text":""},{"location":"webui/StagePage/","title":"StagePage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            StagePage is a WebUIPage of StagesTab.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "},{"location":"webui/StagePage/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            StagePage takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Parent StagesTab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • AppStatusStore"},{"location":"webui/StagePage/#url-prefix","title":"URL Prefix

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              StagePage uses stage URL prefix.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/StagePage/#rendering-page","title":"Rendering Page
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              render(\n  request: HttpServletRequest): Seq[Node]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              render\u00a0is part of the WebUIPage abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              render requires id and attempt request parameters.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              render...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/StagePage/#tasks-section","title":"Tasks Section","text":""},{"location":"webui/StagePage/#summary-metrics-for-completed-tasks-in-stage","title":"Summary Metrics for Completed Tasks in Stage

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The summary metrics table shows the metrics for the tasks in a given stage that have already finished with SUCCESS status and metrics available.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The 1st row is Duration which includes the quantiles based on executorRunTime.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The 2nd row is the optional Scheduler Delay which includes the time to ship the task from the scheduler to executors, and the time to send the task result from the executors to the scheduler. It is not enabled by default and you should select Scheduler Delay checkbox under Show Additional Metrics to include it in the summary table.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The 3rd row is the optional Task Deserialization Time which includes the quantiles based on executorDeserializeTime task metric. It is not enabled by default and you should select Task Deserialization Time checkbox under Show Additional Metrics to include it in the summary table.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The 4th row is GC Time which is the time that an executor spent paused for Java garbage collection while the task was running (using jvmGCTime task metric).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The 5th row is the optional Result Serialization Time which is the time spent serializing the task result on a executor before sending it back to the driver (using resultSerializationTime task metric). It is not enabled by default and you should select Result Serialization Time checkbox under Show Additional Metrics to include it in the summary table.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The 6th row is the optional Getting Result Time which is the time that the driver spends fetching task results from workers. It is not enabled by default and you should select Getting Result Time checkbox under Show Additional Metrics to include it in the summary table.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The 7th row is the optional Peak Execution Memory which is the sum of the peak sizes of the internal data structures created during shuffles, aggregations and joins (using peakExecutionMemory task metric).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              If the stage has an input, the 8th row is Input Size / Records which is the bytes and records read from Hadoop or from a Spark storage (using inputMetrics.bytesRead and inputMetrics.recordsRead task metrics).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              If the stage has an output, the 9th row is Output Size / Records which is the bytes and records written to Hadoop or to a Spark storage (using outputMetrics.bytesWritten and outputMetrics.recordsWritten task metrics).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              If the stage has shuffle read there will be three more rows in the table. The first row is Shuffle Read Blocked Time which is the time that tasks spent blocked waiting for shuffle data to be read from remote machines (using shuffleReadMetrics.fetchWaitTime task metric). The other row is Shuffle Read Size / Records which is the total shuffle bytes and records read (including both data read locally and data read from remote executors using shuffleReadMetrics.totalBytesRead and shuffleReadMetrics.recordsRead task metrics). And the last row is Shuffle Remote Reads which is the total shuffle bytes read from remote executors (which is a subset of the shuffle read bytes; the remaining shuffle data is read locally). It uses shuffleReadMetrics.remoteBytesRead task metric.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              If the stage has shuffle write, the following row is Shuffle Write Size / Records (using shuffleWriteMetrics.bytesWritten and shuffleWriteMetrics.recordsWritten task metrics).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              If the stage has bytes spilled, the following two rows are Shuffle spill (memory) (using memoryBytesSpilled task metric) and Shuffle spill (disk) (using diskBytesSpilled task metric).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/StagePage/#dag-visualization","title":"DAG Visualization","text":""},{"location":"webui/StagePage/#event-timeline","title":"Event Timeline","text":""},{"location":"webui/StagePage/#stage-task-and-shuffle-stats","title":"Stage Task and Shuffle Stats","text":""},{"location":"webui/StagePage/#aggregated-metrics-by-executor","title":"Aggregated Metrics by Executor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ExecutorTable table shows the following columns:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Executor ID
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Address
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Task Time
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Total Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Failed Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Killed Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Succeeded Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • (optional) Input Size / Records (only when the stage has an input)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • (optional) Output Size / Records (only when the stage has an output)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • (optional) Shuffle Read Size / Records (only when the stage read bytes for a shuffle)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • (optional) Shuffle Write Size / Records (only when the stage wrote bytes for a shuffle)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • (optional) Shuffle Spill (Memory) (only when the stage spilled memory bytes)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • (optional) Shuffle Spill (Disk) (only when the stage spilled bytes to disk)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              It gets executorSummary from StageUIData (for the stage and stage attempt id) and creates rows per executor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/StagePage/#accumulators","title":"Accumulators

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Stage page displays the table with named accumulators (only if they exist). It contains the name and value of the accumulators.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ","text":""},{"location":"webui/StagesTab/","title":"StagesTab","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              StagesTab is a SparkUITab with stages URL prefix.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "},{"location":"webui/StagesTab/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              StagesTab takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Parent SparkUI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • AppStatusStore

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                StagesTab is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • SparkUI is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"webui/StagesTab/#pages","title":"Pages","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                When created, StagesTab attaches the following pages:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • AllStagesPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • StagePage (with the AppStatusStore)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • PoolPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"webui/StagesTab/#introduction","title":"Introduction","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Stages tab shows the current state of all stages of all jobs in a Spark application with two optional pages for the tasks and statistics for a stage (when a stage is selected) and pool details (when the application works in FAIR scheduling mode).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The title of the tab is Stages for All Jobs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                With no jobs submitted yet (and hence no stages to display), the page shows nothing but the title.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The Stages page shows the stages in a Spark application per state in their respective sections:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Active Stages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Pending Stages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Completed Stages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Failed Stages

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The state sections are only displayed when there are stages in a given state.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                In FAIR scheduling mode you have access to the table showing the scheduler pools.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"webui/StoragePage/","title":"StoragePage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                StoragePage is a WebUIPage of StorageTab.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "},{"location":"webui/StoragePage/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                StoragePage takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Parent SparkUITab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • AppStatusStore

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  StoragePage is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • StorageTab is created
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/StoragePage/#rendering-page","title":"Rendering Page
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  render(\n  request: HttpServletRequest): Seq[Node]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  render\u00a0is part of the WebUIPage abstraction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  render renders a Storage page with the RDDs and streaming blocks (from the AppStatusStore).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ","text":""},{"location":"webui/StoragePage/#rdd-tables-headers","title":"RDD Table's Headers

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  StoragePage uses the following headers and tooltips for the RDD table.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Header Tooltip ID RDD Name Name of the persisted RDD Storage Level StorageLevel displays where the persisted RDD is stored, format of the persisted RDD (serialized or de-serialized) and replication factor of the persisted RDD Cached Partitions Number of partitions cached Fraction Cached Fraction of total partitions cached Size in Memory Total size of partitions in memory Size on Disk Total size of partitions on the disk","text":""},{"location":"webui/StorageTab/","title":"StorageTab","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  StorageTab is a SparkUITab with storage URL prefix.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "},{"location":"webui/StorageTab/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  StorageTab takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Parent SparkUI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • AppStatusStore

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    StorageTab is created\u00a0when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkUI is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"webui/StorageTab/#pages","title":"Pages","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    When created, StorageTab attaches the following pages (with a reference to itself and the AppStatusStore):

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • StoragePage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • RDDPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"webui/UIUtils/","title":"UIUtils","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    == [[UIUtils]] UIUtils

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    UIUtils is a utility object for...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    === [[headerSparkPage]] headerSparkPage Method

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"webui/UIUtils/#source-scala","title":"[source, scala]","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    headerSparkPage( request: HttpServletRequest, title: String, content: => Seq[Node], activeTab: SparkUITab, refreshInterval: Option[Int] = None, helpText: Option[String] = None, showVisualization: Boolean = false, useDataTables: Boolean = false): Seq[Node]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    headerSparkPage...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    NOTE: headerSparkPage is used when...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"webui/WebUI/","title":"WebUI","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    WebUI is an abstraction of UIs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"webui/WebUI/#contract","title":"Contract","text":""},{"location":"webui/WebUI/#initializing","title":"Initializing
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    initialize(): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Initializes components of the UI

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Used by the implementations themselves.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Note

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    initialize does not add anything special to the Scala type hierarchy but a common name to use across WebUIs. In other words, initialize does not participate in any design pattern or a type hierarchy and serves no purpose of being part of the contract.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ","text":""},{"location":"webui/WebUI/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • HistoryServer
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • MasterWebUI (Spark Standalone)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • MesosClusterUI (Spark on Mesos)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkUI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • WorkerWebUI (Spark Standalone)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "},{"location":"webui/WebUI/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    WebUI takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SecurityManager
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SSLOptions
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Port
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • SparkConf
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Base Path (default: empty)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Name (default: empty) Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      WebUI\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete WebUIs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/WebUI/#tabs","title":"Tabs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      WebUI uses tabs registry for WebUITabs (that have been attached).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Tabs can be attached and detached.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"webui/WebUI/#attaching-tab","title":"Attaching Tab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      attachTab(\n  tab: WebUITab): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      attachTab attaches the pages of the given WebUITab (and adds it to the tabs).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"webui/WebUI/#detaching-tab","title":"Detaching Tab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      detachTab(\n  tab: WebUITab): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      detachTab detaches the pages of the given WebUITab (and removes it from the tabs).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"webui/WebUI/#pages","title":"Pages

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      WebUI uses pageToHandlers registry for WebUIPages and their associated ServletContextHandlers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Pages can be attached and detached.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"webui/WebUI/#attaching-page","title":"Attaching Page
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      attachPage(\n  page: WebUIPage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      attachPage...FIXME

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      attachPage is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • WebUI is requested to attach a tab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • others
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"webui/WebUI/#detaching-page","title":"Detaching Page
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      detachPage(\n  page: WebUIPage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      detachPage removes the given WebUIPage from the UI (the pageToHandlers registry) with all of the handlers.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      detachPage is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • WebUI is requested to detach a tab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"webui/WebUI/#logging","title":"Logging

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Since WebUI is an abstract class, logging is configured using the logger of the implementations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"webui/WebUIPage/","title":"WebUIPage","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      WebUIPage is an abstraction of pages (of a WebUI) that can be rendered to HTML and JSON.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/WebUIPage/#contract","title":"Contract","text":""},{"location":"webui/WebUIPage/#rendering-page-to-html","title":"Rendering Page (to HTML)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      render(\n  request: HttpServletRequest): Seq[Node]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • WebUI is requested to attach a page (to handle the URL)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ","text":""},{"location":"webui/WebUIPage/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • AllExecutionsPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • AllJobsPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • AllStagesPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ApplicationPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • BatchPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • DriverPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • EnvironmentPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ExecutionPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ExecutorsPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ExecutorThreadDumpPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • HistoryPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • JobPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • LogPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • MasterPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • MesosClusterPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • PoolPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • RDDPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • StagePage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • StoragePage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • StreamingPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • StreamingQueryPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • StreamingQueryStatisticsPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ThriftServerPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ThriftServerSessionPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • WorkerPage
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "},{"location":"webui/WebUIPage/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      WebUIPage takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • URL Prefix Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        WebUIPage\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete WebUIPages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"webui/WebUIPage/#rendering-page-to-json","title":"Rendering Page to JSON
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        renderJson(\n  request: HttpServletRequest): JValue\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        renderJson returns a JNothing by default.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        renderJson\u00a0is used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • WebUI is requested to attach a page (and handle the /json URL)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ","text":""},{"location":"webui/WebUITab/","title":"WebUITab","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        WebUITab is an abstraction of UI tabs with a name and pages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"webui/WebUITab/#implementations","title":"Implementations","text":"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • SparkUITab
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "},{"location":"webui/WebUITab/#creating-instance","title":"Creating Instance","text":"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        WebUITab takes the following to be created:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • WebUI
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Prefix Abstract Class

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          WebUITab\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete WebUITabs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "},{"location":"webui/WebUITab/#name","title":"Name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          name: String\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          WebUITab has a name that is the prefix capitalized by default.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/WebUITab/#pages","title":"Pages
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          pages: ArrayBuffer[WebUIPage]\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          WebUITab has WebUIPages.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/WebUITab/#attaching-page","title":"Attaching Page
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          attachPage(\n  page: WebUIPage): Unit\n

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          attachPage registers the WebUIPage (in the pages registry).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          attachPage adds the prefix of this WebUITab before the prefix of the given WebUIPage:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [prefix]/[page.prefix]\n
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/configuration-properties/","title":"web UI Configuration Properties","text":""},{"location":"webui/configuration-properties/#sparkuicustomexecutorlogurl","title":"spark.ui.custom.executor.log.url

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Specifies custom spark executor log url for supporting external log service instead of using cluster managers' application log urls in the Spark UI. Spark will support some path variables via patterns which can vary on cluster manager. Please check the documentation for your cluster manager to see which patterns are supported, if any. This configuration replaces original log urls in event log, which will be also effective when accessing the application on history server. The new log urls must be permanent, otherwise you might have dead link for executor log urls.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • DriverEndpoint is created (and initializes an ExecutorLogUrlHandler)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/configuration-properties/#sparkuienabled","title":"spark.ui.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Controls whether the web UI is started for the Spark application

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Default: true

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/configuration-properties/#sparkuiport","title":"spark.ui.port

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The port the web UI of a Spark application binds to

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Default: 4040

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          If multiple SparkContexts attempt to run on the same host (as different Spark applications), they will bind to successive ports beginning with spark.ui.port (until spark.port.maxRetries).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkUI utility is used to get the UI port
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/configuration-properties/#sparkuiprometheusenabled","title":"spark.ui.prometheus.enabled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          internal Expose executor metrics at /metrics/executors/prometheus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Default: false

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used when:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • SparkUI is requested to initialize
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""},{"location":"webui/configuration-properties/#review-me","title":"Review Me

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[properties]] .web UI Configuration Properties [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Name | Default Value | Description

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [[spark.ui.allowFramingFrom]] spark.ui.allowFramingFrom Defines the URL to use in ALLOW-FROM in X-Frame-Options header (as described in http://tools.ietf.org/html/rfc7034).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used exclusively when JettyUtils is requested to spark-webui-JettyUtils.md#createServlet[create an HttpServlet].

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [[spark.ui.consoleProgress.update.interval]] spark.ui.consoleProgress.update.interval | 200 (ms) | Update interval, i.e. how often to show the progress.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [[spark.ui.killEnabled]] spark.ui.killEnabled | true | Enables jobs and stages to be killed from the web UI (true) or not (false).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used exclusively when SparkUI is requested to spark-webui-SparkUI.md#initialize[initialize] (and registers the redirect handlers for /jobs/job/kill and /stages/stage/kill URIs)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [[spark.ui.retainedDeadExecutors]] spark.ui.retainedDeadExecutors | 100 |

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [[spark.ui.timeline.executors.maximum]] spark.ui.timeline.executors.maximum | 1000 | The maximum number of entries in <> registry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [[spark.ui.timeline.tasks.maximum]] spark.ui.timeline.tasks.maximum | 1000 | |===

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ","text":""}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 8f98d8f14350525f6420524dcde20e32357bdcc8..32f81ca76ab9c6b0baa676f29feee2b803e09e44 100644 GIT binary patch delta 16 XcmbOzGf{?JzMF%Cm*L_@b}l{uBPavP delta 16 XcmbOzGf{?JzMF$%<@XC4*}3=tEinaL